Simple Dynamic Strings library for C

dmunoz · on Feb 6, 2014

I think the library is great, and don't have a whole lot to say about it, but wanted to mention one tangentially related thing:

The README on this repo is awesome. Opening up with advantages and disadvantages? Awesome. Plenty of code examples covering all of the major use cases? Awesome. Quick overview of the internals? AWESOME! Quick two line note about how to use the library in your project? Awesome.

I'm tempted to rant about how I wish documentation was taken more seriously, and that programmers seem to make it a point of pride that spending the first half hour with a library figuring out how to actually use it is just something we have to deal with as programmers, but I won't do so aside from this single sentence.

wting · on Feb 6, 2014

man pages used to be treated with the same reverence, but nowadays nobody seems to bother. I have pretty extensive man pages for some of my open source stuff, but it doesn't seem to help with users downstream.

clarry · on Feb 6, 2014

Don't get discouraged; people who greatly appreciate a well written man page are still around. They're the users you might not hear about because the fine documentation actually helped them. :-)

busterarm · on Feb 6, 2014

Very much this.

To the parent commenter, THANK YOU!

/signed someone who uses man pages all the time and seriously appreciates ones that have had thought put into them.

aleem · on Feb 6, 2014

I used to be quite okay with command line help for small tools, but ever since I discovered http://explainshell.com/ I have a new found appreciation for man pages.

deeviant · on Feb 6, 2014

To be honest, I pretty much hate man pages.

They almost never have examples(Is that some sort of rule?), they have no "quick summary of the shit you use 95% of the time", and they get generally written as a novel, seemingly from the perspective of the developer of the util rather than consumer.

Mind you, I have used man pages many times before, but only because it was the best source of information, not because it was a particularly efficent one.

The Markdown README of this string library is a thousand times better than any man page I have seen.

ploxiln · on Feb 6, 2014

man pages start with the function prototypes, that's your "quick summary".

Near the end there can be an examples section, and many man pages have one. If I run "man 3 open" I see a reasonable examples section.

I like manpages because I'm already editing or running programs in the terminal, and I can pop open the man page in the terminal quite quickly and conveniently, without a search online or even a single network request, without reaching for the mouse, etc.

dmm · on Feb 6, 2014

Come try OpenBSD sometime. They take man pages seriously.

http://www.openbsd.org/cgi-bin/man.cgi?query=strlcpy

raverbashing · on Feb 6, 2014

Man pages are bad but mostly useful

INFO pages on the other hand, are terrible. Is there something less intuitive than the Info reader?

Navigation is awful

I think the RHIDE IDE had a better Info reader, IIRC, that one was useful.

_vya7 · on Feb 6, 2014

I'm not sure it was ever meant to be intuitive or easy to use. As far as I can tell, INFO pages (and much of UNIX) was intended to be a stop-gap solution until something better and more permanent was written atop them. Unfortunately that dream was never realized, and Linux's accidental popularity standardized what was supposed to be a bunch of building blocks. I could be wrong though, but almost every utility in UNIX screams this to me.

DougMerritt · on Feb 7, 2014

INFO is a Richard Stallman thing, it's a GNU thing, it is emphatically NOT a Unix thing.

Stallman wanted to replace the Unix thing (which was always man pages) with INFO, and for many years deprecated man pages, which is part of why many GNU man pages are sub-par.

Not that non-GNU man pages are perfect, but still.

As for Unix being intended to be a stop-gap, your impression is simply historically incorrect, aside from philosophical issues like the claims made in the infamous Gabriel essay "Worse is Better".

> I could be wrong though, but almost every utility in UNIX screams this to me.

Unix/Linux is certainly not perfect, but this simply reflects the truth of Henry Spencer's aphorism, "Those who don't understand Unix are condemned to reinvent it, poorly."

People who think Unix got it all wrong, as opposed to merely having assorted warts, should read Raymond's "Art of Unix Programming".

I was more than a little startled that Raymond captured a lot of the truth of the subject; it's a good read, and can potentially make anyone a better programmer.

Edit: a more concise starting point: http://en.wikipedia.org/wiki/Unix_philosophy

luckydude · on Feb 7, 2014

+1

Info really sucks compared to decent man pages. Sun did good man pages, go look at them, they are quite good.

A lot of the man page hate can be traced back to crappy gnu man pages that were just trying to get you to use info.

Info is cool, it's sort of like a web in text, but it isn't a replacement for a decently written concise man page.

dalke · on Feb 7, 2014

I don't have the same understanding. Quite the opposite. GNU texinfo was designed (in 1986) to both generate manuals and be used as a hypertext system.

I know of nothing in its history to suggest that it was a stop-gap system.

dllthomas · on Feb 7, 2014

Everything in UNIX is supposed to be building blocks, but you're supposed to be able to quickly assemble what you need from them in the shell. I don't think INFO fits this mold particularly well.

barrkel · on Feb 8, 2014

    minfo() { info "$1" --subnodes -o - 2> /dev/null | $PAGER; }

XorNot · on Feb 7, 2014

I'm still looking for a Linux desktop with decent man-page integration. The worst thing about man pages (and a lot of *nix documentation) is that the means of accessing it seem to be stuck in 1970.

dllthomas · on Feb 7, 2014

"They almost never have examples(Is that some sort of rule?)"

They often have examples, though this is certainly better represented in some areas than others. One I looked at just the other day, man 7 aio, is 334 lines. Starts out describing the various aio functions and structures under DESCRIPTION, and has an EXAMPLE section: http://man7.org/linux/man-pages/man7/aio.7.html

Or, also off the top of my head, man 2 signalfd, 196 lines with example code: http://man7.org/linux/man-pages/man2/signalfd.2.html

Certainly there are man pages that aren't up to snuff (fix 'em!) but good man pages exist and many are great!

Theriac25 · on Feb 6, 2014

That would be because GNU decided that man pages weren't good enough for them and decided to use info pages.

_vya7 · on Feb 6, 2014

Github READMEs in Markdown are the new man-pages.

DougMerritt · on Feb 7, 2014

Which is unfortunate, because the majority are either extremely terse, and/or merely consist of installation instructions, so that the information content is nowhere near as high as an average man page.

At least in my experience; maybe I've just been repeatedly unlucky.

I doubt it, though, because there isn't any standard template for them, unlike the situation for man pages, which have a number of standard sections that people generally copy.

clarry · on Feb 6, 2014

Mixing signed and unsigned arithmetic operands.

Not doing trivial overflow checking before arithmetic.

Using results from said arithmetic as indices or starting points for memory writes.

My gut says that any application using this library to process untrusted input is exploitable.

Of course, it was never advertised as being secure :-)

Which is unusual, as many "better strings" libraries make claims about the security of the traditional way of string handling (and then go on to do it wrong).

Edit: To give a concrete example, someone here recommended BString just a few days ago. That library has a "security statement" in which it claims to prevent overflow type attacks by using signed integers instead of size_t, and then checking whether the result is negative after arithmetic. But signed overflow is undefined behavior in C, and these are not guaranteed to wrap around. And yes such things have been exploited. I don't know how likely or easy bstring is to "own" but in any case it's doinnit wrong.

And yes I checked the code, it does what it claims to do..

antirez · on Feb 6, 2014

Hello clarry, I don't think there are APIs that if not grossly misused will lead to security issues in general. There is one issue I'm aware of that perhaps is not easily exploitable but surely is unexpected behavior, that is, overflow when you reach a 2GB string: that requires a check in sdsMakeRoomFor() for sure.

Note that this could be fixed by using uint64_t type for example in the header, however in the original incarnation of SDS inside Redis this was not possible for memory usage concerns. In the standalone library I believe it is instead a great idea, since 2GB is no longer an acceptable limit.

From the point of view of the security the most concerning function looks like sdsrange(), there are sanity checks in place to avoid that indexes received from the outside can lead to issues, but I'll do a security review in the future in order to formally check for issues.

eliteraspberrie · on Feb 7, 2014

If it's any reassurance, I had a look at it, and although there is unnecessary signed/unsigned arithmetic, I could not find anything exploitable. Change the types of variables representing an object size, length, array index, and so on to size_t (from int) and it will be fine.

asveikau · on Feb 9, 2014

Might be acceptable for an individual project where you can make statements about the sizes that will be used ahead of time, but for a general purpose library your statement is absolutely false. The issue is that your size requirement can overflow size_t, malloc gets passed a smaller value than you really intended, then if you try to use what you think you allocated (remember, it succeeded with a smaller size), it's a derefence outside the bounds of the allocation. Changing the size or type of the length variable won't solve that. This is a common and well known vulnerability pattern - it is disappointing to see folks so dismissive of it.

eliteraspberrie · on Feb 9, 2014

I'm aware of arithmetic overflow, thanks. It was the subject of my research (which will eventually be open source, watch www.peripetylabs.com/publications/ if you're interested). See my sizeop library in the meantime: https://bitbucket.org/eliteraspberries/sizeop

I am not dismissive of the problem. I just know from experience that the chances of code like that being fixed at all, let alone correctly, is near zero.

asveikau · on Feb 7, 2014

I think when doing malloc(n * m) or similar the most cautious thing to do would be to check for overflow even if you don't think it's exploitable. Especially for a library. Witness for example that OpenBSD's calloc does an overflow check.

I often leave this out of my own code, but not without feeling somewhat guilty about it.

clarry · on Feb 7, 2014

In general, just because you cannot exploit something doesn't mean nobody else can. What to you seems at worst "merely" unexpected behavior is a very good starting point for someone who actually writes exploits.

Consider what happens if an attacker controls initlen passed to sdsnewlen() (also via sdsnew() if he controls the initial string). He can overflow the arithmetic on malloc/calloc. Now either the memcpy or the NUL-byte assignment can write out of bounds, which is potentially exploitable, especially if the attacker has a nearly unbounded number of attempts (which they often do as far as online-facing services go) or if they can do things to affect the program's memory layout -- which they likely can if they're feeding the program with data.

But if these two aren't enough for a successful exploit (I wouldn't make any bets!), the initlen is assigned to sh->len (whoops, conversion to signed... and possible overflow), which then can completely throw off the arithmetic in sdsMakeRoomFor. Here I would actually be quite willing to bet an attacker can arrange these numbers so that he can do a more controlled out-of-bounds write in one of the functions that rely on sdsMakeRoomFor.

These two functions aren't necessarily the only problematic ones; they're just the ones I noticed after a minute of scanning.

And I don't really agree with what eliteraspberrie said; using the right types is a good start but it's absolutely not fine until after you actually do all the overflow checking; for an example of this, see the OpenBSD man page for malloc: http://www.openbsd.org/cgi-bin/man.cgi?query=malloc

Rule of thumb: whenever you do any arithmetic, ask yourself, can it overflow (how do you know it doesn't?) and who's in charge of making sure it doesn't? In this case it's your API's responsibility. This is exactly the same thing you do whenever you call a function: you need to know if it can fail, and if it can, you need to check the return value and do the right thing.

In the years I've spent reading CVEs (and the broken code that caused the alert), proof-of-concept exploits as well as real world exploits, I've learned that some incredibly subtle and seemingly insignificant things can be exploited. As a general rule, don't worry about exploitability, it's usually not worth your time to prove one way or another. Just fix the code whenever you see something that could go wrong. Make it easy for the next guy who audits your code; so he too can tell that your code is secure, just by seeing the right checks in the right place.

If you need a refresher on security issues in C code, I recommend Chapter 6 from The Art of Software Security Assment, made available free of charge by the authors: http://ptgmedia.pearsoncmg.com/images/0321444426/samplechapt...

It's a good read for anyone doing C these days, and covers a good deal of problems, including the one discussed here. It comes with snippets of known vulnerable real world code too.

adobriyan · on Feb 6, 2014

> prevent overflow type attacks by using signed integers instead of size_t, and then checking whether the result is negative after arithmetic

Whoa!

As if checking for overflow with unsigned is impossible...

rurounijones · on Feb 7, 2014

> My gut says that any application using this library to process untrusted input is exploitable

Well, it is extracted from Redis so that should be easy to test

helmut_hed · on Feb 7, 2014

I just started working at a place where size_t is being cast to "int" all over the code, with similar reasoning. Thanks for the ammunition.

antirez · on Feb 6, 2014

I forked sds.c from Redis given that it was very useful for the project for years, so I guess it may be useful in other contexts as well. There are a few changes at API level compared to the Redis sds.c file, however most of the work went into documenting it.

unwind · on Feb 6, 2014

Nice, thanks for contributing (even more).

I haven't read much of the code at all, but a minor suggestion would be to change:

    struct sdshdr *sh = (void*)(s-(sizeof(struct sdshdr)));

to

    struct sdshdr *sh = (void*)(s-sizeof *sh);

since the entire idea is that the pointer on the left-hand side is to the type whose size should be subtracted, I think it's better not to repeat the type but to "lock it" to the pointer instead. This also (of course) means we can drop the parenthesis with sizeof, since those are only needed when its argument is a type name.

antirez · on Feb 6, 2014

Thanks unwind, this is much better indeed! Fixing.

EDIT: Fixed here: https://github.com/antirez/sds/commit/c636fc6cd25e455a75dca2...

Seems everything is still working but I need definitely more unit tests... Note everything is covered right now.

gandalf013 · on Feb 6, 2014

A more correct version might be:

    #include <stddef.h>

    int buf_offset = (int)offsetof(struct sdshdr, buf);
    struct sdshdr *sh = s - buf_offset;

This is because the compiler might insert padding between your struct elements and the flexible array member. In your case, you're using only int types in the header, so padding shouldn't be an issue on most architectures, but consider the following:

    struct header {
        int i;
        char c;
        char data[];
    };

sizeof(struct header) on my machine is 8, but "data" starts at an offset of 5 from the beginning of the struct. So, to go from "data" pointer to the pointer representing the beginning of the struct, you will need to subtract 5, not 8. Here is a test program:

    #include <stdio.h>
    #include <stddef.h>

    struct header {
        int i;
        char c;
        char data[];
    };

    int main(void)
    {
        struct header h = {0};
        printf("offsetof %zu\n", offsetof(struct header, data));
        printf("sizeof %zu\n", sizeof h);
        printf("start %p, data start %p, delta %d\n",
                (void *)&h, (void *)(h.data), (int)((char *)h.data - (char *)&h));

        return 0;
    }

http://std.dkuug.dk/JTC1/SC22/WG14/www/docs/n983.htm is relevant for this case as well.

deletes · on Feb 6, 2014

Thank you for that.

That surprised me. I thought( and read somewhere ) that the flexible array member comes right after the entire struct, which includes possible padding.

OP got away with it since he always allocates the size of the entire struct, which is 8, plus the string size. And since char doesn't have any alignment requirements it doesn't matter, because he always gets back to correct offset. So data member is basically not used, and if you check the code you will see that it actually in never used!

gandalf013 · on Feb 6, 2014

The standard actually specifies (or specified?) that the flexible array member must come after all the members, including the padding. But, from my reading, that wasn't their intent, and is certainly not how compilers implement it. The link in my original post is an acknowledgement from the committee about this and that the standard needs to be updated.

I think the main reason the OP got away with it is because in his structure, "sizeof(struct sdshdr)" is equal to the offset of "buf" in the struct. This is not necessarily true.

In particular, see the latest C standard draft (http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf), section 6.7.2.1 (Structure and union specifiers), paragraph 18 and 20-21. An example in the standard uses this code:

    struct s { int n; double d[]; };

and says that "but it is possible that"

    sizeof (struct s) >= offsetof(struct s, d) + sizeof (double)

deletes · on Feb 6, 2014

Yeah I was wrong on that, he must have sizes 8 both. My compiler gives 5 and 8.

Coincidentally I just read that specific paragraph earlier today for totally different reasons. :)

Will have to recheck some code.

unwind · on Feb 7, 2014

Very true of course, I guess I assumed that antirez had thought of that, should have noticed that there were no packing enhchantmens on the struct. I love offsetof() for code like this.

simias · on Feb 6, 2014

Heh, you're the first person I meet who advocates dropping the parens around the sizeof argument in any situation. That does look terribly alien to me.

breadbox · on Feb 6, 2014

Really? I always do. If you don't, it makes the variable look like a type. It's the same reason why I dislike extra parens around the value in the return statement. Don't make it look like something it isn't.

adestefan · on Feb 6, 2014

sizeof is a unary operator and not a function. Only types need to be enclosed in parens. Most people don't know this.

unwind · on Feb 7, 2014

Exactly! For some reason this anti-pattern is very thoroughly entrenched among C programmers and it's so very frustrating.

My other favorite is casting the return value of malloc(), something you see a great deal of and that I always oppose. See http://stackoverflow.com/questions/605845/do-i-cast-the-resu... for my best arguments.

Narishma · on Feb 7, 2014

Casting the return value of malloc is necessary in C++. I think that's why it's so widespread, so you can compile your C program with a C++ compiler.

Marat_Dukhan · on Feb 6, 2014

This is neat, but please add a pointer to deallocation function to sds header and use it in sdsfree. This will allow to return sds strings from functions in shared libraries. A common problem it that the user of the library might use a different version of libc than the shared library (this is especially typical on Windows), and when it calls sdsfree on an sds string allocated with a different libc, something awful will happen (in the best case, it will just leak memory). By storing a pointer to the deallocation function in the code which allocated the string you can make sure that it is always released by the same libc version that allocated it.

shadowfox · on Feb 6, 2014

At some point in time, I found this list of C String libraries with some comparisons: http://www.and.org/vstr/comparison

The (different) approaches taken by many of these libraries are interesting.

emmelaich · on Feb 7, 2014

A great list.

I'd add the str bits of http://libslack.org/ to that list.

Aardwolf · on Feb 6, 2014

The biggest missing thing in C imho is destructors, because you need to manually clean up everything whenever you leave a scope or function. That means calling "free" at every exit point.

Or alternatively, putting one "cleanup:" label at the end of the function and using "goto cleanup;" instead of break or return everywhere.

But goto is considered harmful, so that handy pattern seems unclean.

My question is: do you consider the "goto cleanup" pattern clean or not, and if not, what are better alternatives?

dmunoz · on Feb 6, 2014

Many people are correctly citing the usage of goto in the Linux kernel. One thing I would like to add is what Greg Kroah-Hartman says about goto in one of his talks: only jump forward, never backwards. This is precisely what you do when you goto cleanup.

I, personally, think the pattern is clean. A labelled break is basically a goto anyway, yet not always available in your language. I never liked needing a flag to exit some nested structure early. I don't find reading that clean at all.

Also, a minor rant here: Dijkstra's "Go To Statement Considered Harmful" is often mentioned, sometime wordlessly, when goto is brought up, but it seems many of those people misunderstood what was actually being argued. Although he mentions being "convinced that the go to statement should be abolished from all 'higher level' programming languages," his main gripe was that the "unbridled use of the go to statement has an immediate consequence that it becomes terribly hard to find a meaningful set of coordinates in which to describe the process progress." He felt it was "as it stands is just too primitive; it is too much an invitation to make a mess of one's program." So, as others have pointed out already, his main issue was with goto being used in a way that resulted in unstructured, hard to understand programs.

mikevm · on Feb 6, 2014

I had points taken off from a C programming assignment in college once because I used goto for cleanup, and the grader blindly took off points for using goto. It took me quite a few back & forths with him to convince him that my usage was fine. His words were: "the use of goto is rather dangerous and usually leads to spaghetti code and reduction in performance (because of prediction + cache reasons), wherever you used goto you could've used functions".

dmunoz · on Feb 6, 2014

I have heard about this happening to others. In the kernel I developed as part of an operating systems course, I pre-emptively commented my use of goto with a short justification of its usage and cited its use in the linux kernel. I didn't get any response, so I have no idea how the T.A. felt about it. I didn't lose any points, at least.

ufo · on Feb 6, 2014

> wherever you used goto you could've used functions

Ah, if only C would guarantee tail call optimization >:)

antirez · on Feb 6, 2014

Goto for cleanup is not bad IMHO. I use it extensively...

$ cd redis; grep goto *.c | wc -l 251

All the instances are like:

    goto cleanup;
    goto error;
    goto badfmt;
    goto numargserr;

In this context is easy to read and makes the code structure better.

cmollis · on Feb 6, 2014

agreed. Goto's are not bad at all, particularly when dealing with error scenarios. They avoid hack-y if-then error logic (sort of like try/catch/finally)

dllthomas · on Feb 6, 2014

Gotos are bad when used for unconstrained flow control. Restricted use of gotos to implement exception handling is, as you say, not bad.

coherentpony · on Feb 6, 2014

I don't see the difference between this and

    cleanup(state);
    error(state);

etc.

Just pull them each out into a function. You're still DRY and don't lose flow control.

In my honest opinion, there is never a need for goto.

dllthomas · on Feb 6, 2014

If you are constructing arguments in order, you can destroy them in the reverse order at the exit and this lets you cleanly handle a failure in the middle of allocations:

    thing1 = allocate_thing1();
    if(!thing1) goto cleanup1;

    thing2 = allocate_thing2();
    if(!thing2) goto cleanup2;

    ...


    cleanup3:    deallocate_thing2(thing2);
    cleanup2:    deallocate_thing1(thing1);
    cleanup1:    return;

scott_s · on Feb 6, 2014

Because sometimes the tasks you need to perform cascade, and then you're repeating that cascade all over the place. As I pointed out upthread, check out the Linux kernel: http://lxr.free-electrons.com/source/kernel/fork.c?v=3.3#L42...

Notice the cascade after the 'out' label; each label after it is for a different level of cleanup needed, depending on how deep into the system call the function encountered the error. Also notice that this is basically the kind of code one would generate for exceptions.

garenp · on Feb 6, 2014

In performance critical nested loops, goto is the only way to exit early. A typical example here is performing a matrix multiply. In cases like this, it is absolutely necessary, because C doesn't provide a more granular 'break'.

dllthomas · on Feb 6, 2014

You could leave nested loops with a condition flag, which could produce something semantically equivalent to the goto. It seems like the logic here might be simple enough that the actual check could be eliminated in the generated code by a Sufficiently Smart Compiler (that could actually exist), if there was sufficient demand for it. I'm not sure it's actually more readable, though.

null_ptr · on Feb 6, 2014

I'd argue than a condition flag in this case would introduce code bloat (declare, initialize, check in the loop construct) and decrease readability compared to a goto Label.

dllthomas · on Feb 6, 2014

I agree that it typically would decrease readability, and thus wouldn't be advisable. My point was that readability is what we should be arguing, for the few places Go To is in fact appropriate, and not that we need it - even with performance constraints we basically don't.

garenp · on Feb 6, 2014

Yeah, of course I'm aware, it's just that there is a time/space trade-off, so I think folks who argue that goto should never be used or has no legitimate uses, are just taking too much of a hard-line.

And I think it just highlights a lack of understanding about what might make goto "bad" in programming--IMHO it's "bad" when it makes control flow more complex, which undermines maintainability. Common "cleanup" idioms that only jump forward are not that bad, because they don't make control flow particularly more difficult to follow. (Whereas jumping backwards, especially across many state changes, can be very hard to follow.)

dllthomas · on Feb 6, 2014

"it's just that there is a time/space trade-off,"

I don't see a time-space tradeoff here. Compiled naively, I expect the version with flags to be slower and take more space. Compiled sufficiently smartly, I expect them to be equivalent. I expect real compilers to be close enough to the latter for most but not all purposes, but I think if the flag version was more readable and maintainable the place to focus (collectively, medium term) would be on making compilers smart enough. Obviously, in the short term on specific projects you do what you need to.

"I think folks who argue that goto should never be used or has no legitimate uses, are just taking too much of a hard-line."

I agree, but that's because there are places where use of Go To makes things more readable and maintainable, not - primarily - because they are actually needed, per se, even with performance constraints (in the overwhelming majority of cases).

'And I think it just highlights a lack of understanding about what might make goto "bad" in programming--IMHO it's "bad" when it makes control flow more complex, which undermines maintainability.'

I agree wholeheartedly.

agumonkey · on Feb 6, 2014

also suggested here http://c.learncodethehardway.org/book/ex20.html

dllthomas · on Feb 6, 2014

FWIW, gcc does have destructors for stack-allocated data as an extension: http://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html#i...

    #include <stdio.h>
    
    typedef struct foo foo_t;
    
    void foo_cleanup(foo_t *f);
    
    struct foo { int i; };
    
    void foo_cleanup(foo_t *f) {
        printf("f %d\n", f->i);
    }
    
    int main() {
    
        foo_t a __attribute__((cleanup(foo_cleanup))) = { 7 };
    
        printf("l %d\n", __LINE__);
    
        {
            foo_t b __attribute__((cleanup(foo_cleanup))) = { 8 };
            
            printf("l %d\n", __LINE__);
        }
    
        printf("l %d\n", __LINE__);
    }


    l 18
    l 23
    f 8
    l 26
    f 7

vsbuffalo · on Feb 6, 2014

Yea, I agree. I love C and was a C purist and flat-out refused to learn C++ well and use it in practice. This recently changed and now I quite enjoy C++. Destructors are one reason why. The whole idea of resource-managing classes and RAII is clever and really useful.

The hard part is that it takes a lot to be a good C++ programmer. C is such a smaller language, you can master a handful of best practices, gain experience, and become a competent C programmer. In contrast, C++ is such a large language you have to be active in learning everything — read Meyers, Modern C++, GoF's Design Patterns, etc. Anything short of this, and you're going to accidentally reinvent the wheel or do something stupid. But still, C++(11) is a terrific language that I'm learning to love.

munificent · on Feb 6, 2014

> But goto is considered harmful,

Dijkstra never considered the kind of goto you describe as harmful. This is a distinction that's lost on most programmers now that the kind of languages that did have harmful gotos are dead and forgotten.

What he argued against was gotos that jump across procedure or scope boundaries.

ufo · on Feb 6, 2014

I think part of the problem is that everyone heard that goto is harmful but noone ever reads the original letter, even though its reads like a short blog post (assuming they had blogs back then)

http://www.u.arizona.edu/~rubinson/copyright_violations/Go_T...

According to Dijkstra, the really evil gotos are the ones that force you to keep a track of the whole execution path in order to figure out the current state of the program. Its much easier when you can look at a line of code and statically know what the program state will be like at that point.

A for loop is better than goto because you can look at a line inside it and instantly know that it will run a number of times, with the index changing in increasing order.

gotos for resource cleanup are good because you can look at a given line of code and know what resources still need to be freed.

Code that gets rid of gotos and break statements by blindly replacing them with tons of flags is just as bad as gotos because you still need to look at the whole execution path to figure out the state in the flags.

Someone · on Feb 7, 2014

Relevant quote from "GO TO statement considered harmful": "The remark about the undesirability of the go to statement is far from new. I remember having read the explicit recommendation to restrict the use of the go to statement to alarm exits, but I have not been able to trace it; presumably, it has been made by C. A. R. Hoare."

He doesn't explicitly say it, but it seems he agreed with that use of go to.

Narishma · on Feb 7, 2014

> I think part of the problem is that everyone heard that goto is harmful but noone ever reads the original letter

Same thing for the premature optimization quote from Knuth.

beagle3 · on Feb 6, 2014

In fact, his original title was "a case against the goto statement". It was the editor of CACM that gave it the link-baity (for the time) title "considered harmful".

scott_s · on Feb 6, 2014

The goto cleanup pattern is a good one. It's important to know why certain constructs are frowned upon. That pattern is not one of them.

Look at the Linux kernel; it's used extensively. For example: http://lxr.free-electrons.com/source/kernel/fork.c?v=3.3#L42...

effdee · on Feb 6, 2014

"But goto is considered harmful"

GOTO _was_ considered harmful back in 1968(!) when structured programming was still in its infancy. Code of that time was usually peppered with GOTOs and therefore a pain to read.

Using GOTO for cleanup tasks is perfectly fine because it increases readability.

lucian1900 · on Feb 6, 2014

"goto cleanup" is fine, it's precisely how exceptions work anyway.

It's a commonly used pattern, most famously in the Linux kernel.

dllthomas · on Feb 6, 2014

If you read much of the original discussion around the "Go To Statement Considered Harmful", this pattern is frequently pointed out as a reasonable use case absent proper exception handling.

Indeed, from the original editorial: "I remember having read the explicit recommendation to restrict the use of the go to statement to alarm exits, but I have not been able to trace it[.]"

It is the right way to handle exceptional conditions in modern C.

AndyKelley · on Feb 6, 2014

This is a question I've wondered myself. It's nice to see some different answers and reasoning here.

The solution I came up with myself is to have a destroy() function that is capable of cleaning up no matter what state the object is in.

Example: https://github.com/andrewrk/libgroove/blob/bc7d72589eb297342...

Notice all the calls to groove_playlist_destroy. And then in the destroy function it checks the conditions it has to before cleaning up.

I'm not saying it's necessarily the best answer but I think it's a pretty clean strategy.

jhallenworld · on Feb 6, 2014

Well here is one alternative: allocate temporary strings on a stack basis. Have some higher level functions reset the stack pointer. So now things are allocated temporarily by default, and only in the (hopefully) rare case where you need a long-lived item do you promote it to long lived. Promoting could mean copying the string to the heap, or it could just mean marking it so that the stack cleanup mechanism skips it (so perhaps you have a linked list of temporary items or something similar).

nitrogen · on Feb 6, 2014

Look at the man page for alloca(). Stack-allocated strings are more dangerous in the face of overflow than heap-allocated strings.

jhallenworld · on Feb 6, 2014

Well I did not mean that the strings are literally on the call stack. You can make your own stack (and it doesn't have to be a stack, just have the same semantics).

coherentpony · on Feb 6, 2014

You still have to 'free' in destructors in C++. Destructors don't free your memory for your custom types for you.

dllthomas · on Feb 6, 2014

You need to free heap-allocated objects, but stack allocated objects are important and useful (particularly with RAII) and cleanly reversing such allocations eliminates a class of bugs.

mike-cardwell · on Feb 7, 2014

And of course, you don't even need to explicitly free heap-allocated objects if you're using C++11 smart pointers like shared_ptr and unique_ptr.

dllthomas · on Feb 7, 2014

... mostly. Much like in a garbage collected language, you need to make sure you are no longer referencing things when you want resources to be freed, but it does make the worst case much harder to run into.

jlarocco · on Feb 6, 2014

Stack allocated objects are cleaned up automatically in C, too...

dllthomas · on Feb 6, 2014

RAII. [1]

Stack allocated objects have their memory freed in C. Other resources are not released.

Consider the following C, with and without gcc extensions:

    #include <stdio.h>
    #include <stdlib.h>
    #include <malloc.h>
    #include <fcntl.h>
    #include <unistd.h>

    #ifndef GCC_EXTENSIONS
    #define GCC_EXTENSIONS 1
    #endif

    int main() {
        {
            #if GCC_EXTENSIONS
            void close_fd(int *fdp) {
                if(*fdp >= 0) close(*fdp);
            }
            #else
            #define __attribute__(...)
            #endif
    
            int fd __attribute__((cleanup(close_fd))) = -1;
    
            fd = open("somefile", O_CREAT | O_WRONLY, 0644);
    
            system("ls /proc/self/fd");
        }
    
        system("ls /proc/self/fd");
    }

with gcc extensions:

    0 1 2 3 4
    0 1 2 3

without gcc extensions:

    0 1 2 3 4
    0 1 2 3 4

[1] http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initial...

vinkelhake · on Feb 6, 2014

The point is that the 'free' is now in one place and not in every place that uses the class.

T-zex · on Feb 6, 2014

You mean heap allocated types? Stack allocated will be deallocated implicitly.

wbond · on Feb 6, 2014

Recently I've been getting into some C and having programmed in a number of other languages know enough to be sure I wanted to have a solid understanding of properly handling encodings and strings in general. I've obtained a good understanding, but now I am looking to abstract some of the low-level mechanics. Unfortunately most of the libraries I have seen (such as bstring) use structs. It was nice to see this approach taken. Thanks for extracting it and making it easy to find and learn about!

On a related note - does anyone have recommendations for a similar small library for dealing with conversions from char * to wchar_t * and basic encoding duties? I'm working cross-platform, and so far I've stitched together some functions wrapping stuff like wcsrtombs() and WideCharToMultiByte().

_paulc · on Feb 6, 2014

Excellent news - the Redis source code is a really good source of high quality dependency-free C code and tends to be the first place I look.

Some time ago I forked the SDS code and added a set of additional utility functions around this for a project I was working on at the time (basic file reading, regex, LZF compression, Blowfish encrypt/decrypt, SHA256 etc).

This was from a fairly old SDS version so this looks like a good opportunity to sync up with the library version.

Repository is here: https://github.com/paulchakravarti/sdsutils

eis · on Feb 6, 2014

Disadvantage #1 can be simply solved by using a pointer to the "sds" type instead.

Then you can be sure that

  sdscat(&s, "Some more data")

updates s to always point at the right memory address and you can't introduce hard to find bugs by forgetting to assign to s which the compiler wont warn you about. If you'd pass just "s" instead of "&s" as the first parameter, the compiler would error out.

So all functions modifying the string should take a pointer to it.

_vya7 · on Feb 6, 2014

My biggest gripe with C's string functions is that it's really easy to make off-by-one errors when trying to do anything with two strings, due to NUL and the way they seem to handle it inconsistently. I'm constantly having to check the docs for any given string function to find out how it handles NUL (which isn't always in an obvious spot thanks to the design of man pages). And once I've found it, I have to come up with some contrived example that usually needs to be written in a comment just to make sure I've used its intended algorithm correctly and didn't cause an off-by-one error. If this library hides all that for me, I'm sold.

FooBarWidget · on Feb 6, 2014

Sigh, another string library. And this is the reason why I prefer C++ over C. You don't end up writing another string library for the 300th time. There aren't that many differences between string libraries anyway. Most of them are just structs with a pointer to a memory block, plus a length field.

geminitojanus · on Feb 6, 2014

> Sigh, another string library. And this is the reason why I prefer C++ over C. You don't end up writing another string library for the 300th time.

You say that, and yet every C++ project I've ever touched in my life has had its own string class with various levels of horror attached. My favorite was the one that stored everything internally as 32-bit characters to be Unicode safe, and was never used in a codebase that had to deal with Unicode.

antirez · on Feb 6, 2014

Yep, but this one has not the usual layout of struct+pointer (that's the whole point), and I'm not sure it qualifies as "yet another" since it was written in 2006.

pacaro · on Feb 6, 2014

It appears to me to be "yet another" because the length before the string approach is usually referred to as a b-string.

windows (for example) has had a comprehensive b-string library (type is called BSTR) for about 20 years - due to it's age and provenance it has the downside of thinking a character is a 16-bit value...

shadowfox · on Feb 6, 2014

Just to add a reference to the parent: http://www.johndcook.com/cplusplus_strings.html

(The article is mildly interesting by itself)

FooBarWidget · on Feb 6, 2014

You're right, sorry. I only scanned the document, saw the struct, and closed the tab. The way you use a header before the data is similar to how Ruby represents its strings internally.

munificent · on Feb 6, 2014

> And this is the reason why I prefer C++ over C. You don't end up writing another string library for the 300th time.

Ironically, in almost all of my uses of C++, I did end up writing or using a custom string library there too. (I was doing mostly console games or language interpreters.)

pekk · on Feb 6, 2014

Yes, many people who have written much C have written some version of this library.

jhallenworld · on Feb 6, 2014

I'm not so happy with C++ string either. Main complaint is that I like the use of C strings as "semi-predicates"- you can test for NULL to indicate a failure. I wrote my own String class at one point to provide this feature:

  // These provide test if assigned/not assigned
  inline operator void*() const
    {
    return (void *)s;
    }

  inline bool operator!() const
    {
    if(s) return 0;
    else return 1;
    }

dllthomas · on Feb 6, 2014

Better not to pun, and use an actual optional where you want it...

jhallenworld · on Feb 6, 2014

Eh, I like my C idioms. With C++ it's easy to make it safe so that accessing an unassigned string doesn't cause a crash (make NULL equivalent to empty strings except for tests).

I have the same gripe with the way the STL is designed. Too tedious to test for empty first before reading an item.

pjmlp · on Feb 6, 2014

Not only that.

You also have x vector libraries for helping with bound checking in secure sensitive code and macro based generic data structures.

No thanks.

vvde · on Feb 6, 2014

Making sds a typedef for char* is very convenient. But it makes it very easy to pass an sds to a function that expects a C string without checking for null bytes.

Ruby, Java, Perl, PHP have all had security problems when interacting with C because they failed to properly distinguish binary-safe strings and C strings.

http://insecure.org/news/P55-07.txt http://cwe.mitre.org/data/definitions/626.html

kzrdude · on Feb 7, 2014

I'd prefer a typesafe version (that would be a library with a struct type). It could even be a trivial wrapper struct for the char *.

scottlamb · on Feb 6, 2014

Disadvantage #1 seems unnecessary. Instead of this:

  s = sdscat(s,"Some more data");

Why not do this?

  sdscat(&s,"Some more data");

The latter would make the use-after-free error they're describing impossible. (Disadvantage #2, changing one reference but not others, would remain. And callers would still need to check for NULL if they intend to handle ENOMEM gracefully.)

I assert there's no meaningful performance difference between the two.

paraboul · on Feb 6, 2014

To be consistent between allocation and reallocation (like malloc() and realloc()).

And you still want to access the old value if reallocation failed.

scottlamb · on Feb 6, 2014

> To be consistent between allocation and reallocation (like malloc() and realloc()).

Consistency is a tool that is often useful to promote program correctness. You're suggesting using it to accomplish the reverse.

If it's important aesthetically that all sds operations be consistent in this regard, you can structure the allocation interface in the same way:

  sdserror sdsnew(sds*);

  sds mystring = NULL;
  sdsnew(&mystring, "Hello World!");

This is a common practice in relatively new interfaces like pthread_create.

> And you still want to access the old value if reallocation failed.

If you want to provide commit-or-rollback semantics, you could signal error via return value rather than by replacing s with NULL:

  if (sdscat(&s,"Some more data") != SDS_SUCCESS) {
    /* failure path; s is unchanged */
  } else {
    /* s now has "Some more data in it" */
  }

but this may be completely useless, depending on the environment(s) in which the library or program is intended to be used. On 64-bit Linux systems with memory overcommit enabled (the default), this failure path essentially shouldn't ever happen. Instead, some process (maybe yours, maybe not) will be picked by the kernel OOM killer. Many programs just use an allocate-or-die interface, as the sds README mentions.

paraboul · on Feb 6, 2014

You're right, I wasn't saying that it's the only way to write allocation/assignment. I was saying that in that case sdscat() matches with sdsnew() style.

clarry · on Feb 6, 2014

To be fair, the library already assigns returns from realloc in such a way that you cannot retrieve the old data if realloc failed.

Not nice I know.

numeromancer · on Feb 6, 2014

NB: In the following function:

    /* Like sdscatpritf() but gets va_list instead of being variadic. */
    sds sdscatvprintf(sds s, const char *fmt, va_list ap) {
        va_list cpy;
        char *buf, *t;
        size_t buflen = 16;
    
        while(1) {
            buf = malloc(buflen);
            if (buf == NULL) return NULL;
            buf[buflen-2] = '\0';
            va_copy(cpy,ap);
            vsnprintf(buf, buflen, fmt, cpy);
            va_end(cpy); // <--- add this ----
            if (buf[buflen-2] != '\0') {
                free(buf);
                buflen *= 2;
                continue;
            }
            break;
        }
        t = sdscat(s, buf);
        free(buf);
        return t;
    }

From the `man va_copy` on my system:

    Each  invocation  of  va_copy()  must  be  matched  by  
    a  corresponding invocation of va_end() in the same
    function.

This is not likely to be a problem in most systems, but it can't hurt to be formally correct.

fmela · on Feb 6, 2014

I think you meant "va_end(ap);".

Created a pull request for you: https://github.com/antirez/sds/pull/8

numeromancer · on Feb 6, 2014

No, I'm pretty sure that `va_end(cpy)` is correct. `ap` is the `va_list` passed into the function. `cpy` is the local copy.

fmela · on Feb 17, 2014

You're right. Fixed.

gtrubetskoy · on Feb 6, 2014

I've had the pleasure of getting pretty deep into sds back when I was tinkering with thredis, and it's definitely very cool.

acqq · on Feb 7, 2014

The CString C++ class from Microsoft's MFC and now also ATL used since forever the trick of having both the length and the reference counting in the single allocation with the characters themselves, as well as additional 0-termination character even if it was initialized with non-zero terminated character.

oh_sigh · on Feb 6, 2014

In the case of disadvantage #1: s = sdscat(s,"Some more data");

You could fix that by adding a "remote pointer" header field. Inside of sdscat, you would allocate a new sds struct, and set the remote pointer header field in `s` to the new sds structs location. You could also try to do a realloc and maybe youll get the same starting pointer back again.

mbreese · on Feb 6, 2014

How much better is this dynamic approach than something like immutable strings? (not that char* is immutable)

Since with the sds library, you could potentially be getting back a new pointer for each operation, you could just as easily be working from an immutable string library. Do immutable strings have poor performance? I've never really considered it.

sparkie · on Feb 6, 2014

Disadvantage #1 isn't really a disadvantage in modern computing, it's the right thing to do.

aidenn0 · on Feb 6, 2014

I strongly disagree. It is a tradeoff between that and advantages #1 and #3; with the bstring library, for example, you pass in a mutable string and the function mutates it.

In no world is passing in a mutable value, having it mutate it, but then having to reassigning your variable superior.

Passing in an immutable value, and then assigning a fresh value is reasonable, but that's not what SDS is doing, AFAICT.

I think gcc has some decorations you can have to tell it to never ignore the return function.

I'm curious why s = sdscat(s,"Some more data"); didin't end up like: sdscat(&s,"Some more data");

angersock · on Feb 6, 2014

So, this is an interesting approach to strings:

It's basically a malloc()/free() implementation with printf, some formatting, and strcat bolted onto it. (Strictly speaking, it may or may not use free lists or whatever, but the use of the header and returned pointer is quite similar.)

And that's awesome.

codys · on Feb 7, 2014

ustr [1] is another library to consider. It uses a variable sized header to avoid having large (percentage wise) overhead for short strings.

1: http://www.and.org/ustr/

wildmXranat · on Feb 7, 2014

I remember lifting the 'sds' string implementation from within the redis back in the day. Thanks antirez

on Feb 6, 2014

[deleted]

matthiasv · on Feb 6, 2014

> SDS is a string library for C [...]

Function overloading does not exist in C.

rossy · on Feb 6, 2014

C99 doesn't support function overloading. I guess that style is used by everything that wants to expose a C compatible API.

Come to think of it, it might be possible to do that with C11 _Generic.

antirez · on Feb 6, 2014

And... SDS and plain strings are both char pointers so even with _Generic it is probably not possible to do that.

wereHamster · on Feb 7, 2014

How does it compare to strbuf that is used in git?