Show HN: C library of generic, reference-counted data structures

comex · on June 15, 2013

A small observation: CoreFoundation typedefs CFTypeRef (the generic object type) to void * , rather than having obj * be a separate type as in this library. This lets you pass pointers to any CF type to functions that expect CFTypeRe without the ugly cast: it's unsafe, as you can pass a pointer to any type without the compiler complaining, but so is explicitly casting. If you're not going to go the void * route, it would be nice to provide some macro like obj(x) to safely cast only valid types to object.

Anyway, it's nice to see a portable, modern CF-like library, although of course it's too slow for some cases where code specialization is called for.

Edit: On the general subject of C data structures, in case anyone's interested, I wrote a little portable C header that provides minimal hash table functionality using macros, with a large amount of control - still generic, but totally opposite to the approach used by this library, much lower level. It's modeled after the BSD <sys/queue.h> header, provides three versions (open, closed, and a higher level version), and is completely undocumented, but I'm somewhat happy with it as a proof of concept. Just throwing it out there. https://github.com/comex/cbit/blob/master/hash.h

jzwinck · on June 15, 2013

I was interested to know how this compares with existing C libraries, so I "ported" the prime number example program to use GMP and GLib instead. It's not quite working yet (see the description) and I have to run now, but here's the code--I think it's nearly complete and would be interesting to benchmark once done. https://gist.github.com/jzwinck/5787359

_cxbj · on June 15, 2013

Shoot, if I had known someone was going to do this I would have made the prime number program a bit more efficient. Since it looks like you've ported the algorithm exactly it shouldn't affect the result though. Let me know what the test reveals, I tried to write the fasted arbitrary precision algorithms I could.

andrewcooke · on June 15, 2013

nice. it's heartwarming to see things like this in c. in my spare time i am working on struct-sql mapping. it sometimes feels like the popular c libraries don't try hard enough to provide the convenience we get in other languages. with a little imagination (as here) i think we can push things a little further...

paisawalla · on June 15, 2013

Is your project hosted somewhere? I would love to see the source for that, no matter how early/unpolished it is.

andrewcooke · on June 15, 2013

sure, there's an example at https://bitbucket.org/isti/c-orm/src/4ec76f741d756bf70c2b1b8... but i don't have any text explaining it yet, and it's a bit hard to see what is happening.

but basically there's a python script that parses your struct definitions and then auto-generates a library (this isn't great i know, but what else can you do? if you don't want to use that, you can still use it as an SQL library, but you need to write your own callback functions to do the work of setting values in the struct). that library is

    #include "phonebook.corm.h"

in the linked code. then you can do things like

    static int find_name(isti_db *db, const char *text, name** name) {
      corm_name_select *select = NULL;
      *name = NULL;
      STATUS;
      CHECK(cname.select(&select, db));
      CHECK(select->name(select, "like", text)->_go_one(select, name));
      EXIT;
      if (status == ISTI_ERR_NO_RESULT) status = ISTI_OK; // see NULL name
      if (select) status = select->_free(select, status);
      RETURN;

}

(the STATUS, CHECK, EXIT and RETURN are just macros for the usual "return an int as status and goto exit on error" handling). the snippet above populates the name (a struct typedef) pointer with values (id and name value in this case) from the database. the thing called select is a struct with function pointers that generates SQL select functions (so there's the usual objects-in-C pattern of passing the struct in to the function in "select->name(select, ..." for example). and the SQL being generated is "select * from [table] where name like [value]". it's all escaped correctly to avoid SQL injection.

the API for the database backend is isolated out, but the only current implementation is sqlite.

the complete repo is at https://bitbucket.org/isti/c-orm/src (if you download it and run doxygen you should see better docs - but they're incomplete and not online yet)

if people are interested, email me at [email protected] and i'll get back to you when it's in a usable state (it mainly needs docs, polishing, and perhaps another database backend - probably postgres).

also, of course, there are many limitations compared to ORM systems. there's no way to retrieve related objects, for example (so if a struct has a pointer to another struct, that pointer isn't retrieved - the best you can do is also store a pointer to the primary key and then make a second call based on that). and currently table and column names must exactly match struct and field names.

canadev · on June 15, 2013

What is struct? Or do you mean the C struct type <-> SQL? Like an ORM for C?

andrewcooke · on June 15, 2013

yup, exactly - see link above.

obiterdictum · on June 15, 2013

Along the same lines, there is also CCAN.

http://ccodearchive.net/list.html

https://github.com/rustyrussell/ccan

And GMP for arbitrary precision math (integers and floating point numbers).

http://gmplib.org/

huhtenberg · on June 15, 2013

I find that the decision to adopt a C library for a project frequently boils down to whether it is using a compatible naming notation. Especially for simpler libraries.

Linked library in nice, but how many C projects do you know that use camelCase() function naming as opposed to K&R's lower_case_naming()? If it were a complex library, like OpenSSL, then - sure, to the hell with the notation, just put a wrapper around it and use it anyway. It's barely an issue. But if it is a simpler library that is meant to be weaved into the code, like data containers, the choice of naming notation is always a thing to consider.

There is obviously an indent tool, but resorting to it means using a modified version of the original with all the consequences that follow. Perhaps, it might be the next thing for GitHub to tackle - "Download in Xyz naming notation"... I know I'd use it.

mappu · on June 15, 2013

Was there a rationale to choosing performActionOnStruct() names instead of the perhaps-more-idiomatic struct_perform_action() style?

I strongly advise against putting identifiers like release() and getCString() in the global namespace, that's probably not the wisest idea if you plan to use libraries other than your own.

_cxbj · on June 15, 2013

Some bad habits from Java are the only reason. I thought about having obrelease and obgetCString. Putting ob at the beginning of every function seems like overkill, and I don't like that only some functions had would have ob at the beginning if I didn't do a global name change. Any suggestions?

mappu · on June 16, 2013

> Putting ob at the beginning of every function seems like overkill

That's "C-style namespaces" - really the only sane way to avoid an identifier collision, sorry.

McUsr · on June 15, 2013

Hello. May I suggest looking for some code by tptacek for the red-black trees. I am looking too, but so far , I haven't been able to find it. I read about it first in this link: <https://news.ycombinator.com/item?id=4455225>.

I think your project looks very interesting. Reference counted datastructures is a great convenience.

Thanks, and Good Luck further.

canadev · on June 15, 2013

I've been meaning to refresh and expand both my algorithms knowledge and my C programming with a project like this, myself (as an application developer I find I've barely used either since graduating from university). Was that part of your goal at all?

I think this is pretty cool. I like the presence of the tests as well.

_cxbj · on June 15, 2013

I just graduated, so most of the data structures are still pretty fresh in my mind. I started after getting tired of implementing data structures in my many C based classes, hoping that at least I would get some use out of them during school. Sadly after that I never took another C based class, but the project became interesting in its own right and I stuck with it.

icodestuff · on June 16, 2013

Definitely forking this to provide a saner API. And some of the implementation is a little iffy (like objIsOfClass). But in general, very cool. Thanks for contributing to the C community!

Any particular reason why you're avoiding the C99 _Bool?

rdtsc · on June 15, 2013

I personally like and use uthash (no reference counting) but:

* only a header file * has hash, array, string, list, implementations

https://github.com/ned14/uthash

simgidacav · on June 15, 2013

This recalls me one of my oldest project: http://libdacav.sourceforge.net/

However I didn't do reference counting. Good job OP!

songgao · on June 15, 2013

star-able url: https://github.com/theck01/offbrand_lib/

pepijndevos · on June 15, 2013

What are "class[es] built from functions and incomplete types"?

I once built classes using structs and function pointers, but this sounds different?

_cxbj · on June 15, 2013

An incomplete type is a struct where the definition of the structure is hidden from the view of most of your source code. This makes it impossible to directly access struct elements, and is like data hiding in a object oriented language. Of course in this library you can always include the header containing the struct definition and go bananas so its a soft restriction.

enqk · on June 15, 2013

I would suggest offering the ability to plug custom allocators.

ambrop7 · on June 15, 2013

If you feel a need to do that, check cmccabe's comment on the top about intrusive data structures. These let you do your own memory management, in a way that the data structure implementation doesn't need to know about.

talloaktrees · on June 15, 2013

Glad to see this.

cmccabe · on June 15, 2013

It's nice that you made these available. You might also want to look at queue.h from BSD (see http://fxr.watson.org/fxr/source/sys/queue.h) and tree.h (http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/sys/tree.h), single-file "libraries" that can generate a few different kinds of linked list and binary tree. These don't require typecasting at all, since they generate functions for your particular type. They are also intrusive data structures, meaning you control your own allocation.

With regard to OBString: don't. Really. C strings are very flexible and powerful with just what's in the standard library. You don't need charAtStringIndex, for example: you have rindex and index. You don't need splitString; you have strtok_r. You don't need findSubstring; you have strstr.

If you think you need regexes, consider fnmatch, a libc function which matches shell wildcard patterns. It's always there, it's simple, and it's fast. Or you could bring in something like PCRE. In general, if you need full regexes in a small program, it might be worth rethinking whether your program should be in C anyway.

For some reason, I don't find myself using resizable array utility code much in C. It's pretty easy to do it yourself with realloc-- that's really one of the first things they should teach people learning practical C. Perhaps a small, minimal set of macros could wrap the reallocs and make it a little bit nicer.

ambrop7 · on June 15, 2013

Intrusive data structures are indeed more powerful. I made my own instrusive AVL-tree which can be found here [1] and an example here [2]. There's also the extra feature that the concept of a "link" is abstracted, so you can for example build a compressed AVL-tree inside an array using array indices instead of pointers, which don't break when the array is reallocated. It's also built in a different way than this usually done (macros), that is, a header file is included which redefines some identifiers, and the actual code of the functions is not inside macros, hence, easier to read.

About strings, I firmly believe that C's zero-terminated strings should be avoided and only used when necessary to communicate with existing code. They're really not simple and fast. You can't take a null-terminated string and extract a null-terminated sub-string without modifying the original - which is very often needed during various kinds of parsing. Also, null-terminated strings are unable to represent the zero byte, so in general, for example, you can't use them to store the contents of an arbitrary file, and if you do use them for purposes where null bytes can appear, you have to be careful and handle errors where nulls would implicitly truncate your string.

Just use (pointer, length) strings instead. It'll save you a whole bunch of trouble you may not see coming.

[1] https://code.google.com/p/badvpn/source/browse/#svn%2Ftrunk%... (CAvl_)

[2] https://code.google.com/p/badvpn/source/browse/#svn%2Ftrunk%... (cavl_test_)

mtdewcmu · on June 17, 2013

I don't see anything inherently broken about null-terminated strings. It means that the smallest string that can be represented is a single byte, and the largest string that can be represented is unlimited. By mixing in an integer, you're expanding the smallest string and placing an arbitrary limit on the largest string. If the integer and string data are mixed in a struct, then strings pack much less tightly, not just because of the space taken up by the integer, but also because it needs to be aligned. Most of the things you'd want to do with a string are O(n) anyway, like print to the terminal. Null-terminated strings aren't perfect for every use, but no representation is, and C doesn't prevent you from using a different representation when it's more appropriate. If you're writing a parser, then that's a more demanding application than usual, and you ought not feel locked into using the default representation.

mtdewcmu · on June 15, 2013

"For some reason, I don't find myself using resizable array utility code much in C."

Most of the time, the amount of memory you'll need to do something is predictable. When dynamic arrays are always at hand, you lose the habit and end up always reaching for one.

wrl · on June 15, 2013

Should you ever need said macros, I've written a set: https://github.com/wrl/wwrl/blob/master/vector.h

cmccabe · on June 17, 2013

Thanks, that's interesting. I would probably add some kind of callback to free individual elements if I ended up using this. A lot of the boilerplate when managing resizable arrays in C ends up being making sure you properly dispose of elements. Or at least it was for me when I last did this.

wrl · on June 23, 2013

Absolutely, but that's something I feel that could be implemented on top of these macros if necessary.

Currently I'm using this code just for a string buffers and other arrays where the individual elements are stored in-place in the array.

_cxbj · on June 15, 2013

Sadly some form of OBString is a necessity, any string object needs a reference count and some function pointers packaged in the struct to work. Converting OBString to have the same interface as standard string handling functions is a good idea though, and using the standard functions as a backend would also be done in that case.

ExpiredLink · on June 15, 2013

> You might also want to look at queue.h from BSD

CPP macros, entirely. No, thanks.

pmjordan · on June 15, 2013

I'm not a fan of macros either, which is why I've ended up implementing hybrid C data structures, with the logic in functions, and some optional type helpers in macros:

https://github.com/pmj/genccont

Anyway, these are largely intrusive and don't do reference counting, but that can be seen as an advantage or disadvantage, depending on the situation. (I use these heavily in kernel code) The hash tables (chaining and open addressed linear probing) use function pointers to be type-generic, which might not be to everyone's taste either.

cmccabe · on June 18, 2013

Well, this is a good effort. I think you are quite right to leave out reference counting. Reference counting is something that you should pay for only if you need it.

I don't really understand the fear of macros. void pointers are much scarier than macros, since they open you up to the threat of bugs going undetected by the type system.

You're not by any chance maintaining one of those much-maligned "driver portability" layers, are you?

pmjordan · on June 20, 2013

For better debugging and diagnostics, I like my macros to do as little as possible. Plus, if the generated code is the same for all types apart from an offset calculation, duplicating it is silly.

If you look at the slist, slist_queue and dlist, you'll find there are typed macros (via a container-of macro) that call down to the generic functions. I might try to extend this to macros that generate a full set of shim inline function for a particular type of list element, if it turns out to be less error prone. You won't see a void pointer being passed around on any of these.

Where function pointers come into play (binary tree, hash tables) I'll admit it's maybe not quite so clear cut as the indirection messes with branch predictors, etc. and the callbacks all take void pointer arguments. Maybe for those containers it would be better to have generator macros that produce function prototypes and implementations.

And no, I'm not doing anything particularly focused on driver portability. I mostly write OS X kexts, and the built-in data structures in xnu aren't terribly helpful. (the OSContainer family uses blocking memory allocations, can only hold refcounted objects and the map and set use linear search) Some kernel code I write is designed to be portable, and I use these data structures for that code, but it's pretty much just a standard factored design - general core and platform specific setup and API use. Nothing with any pretence of being a general solution.

nitrogen · on June 15, 2013

Just as an anecdote, in one of my university classes we had the assignment of writing and optimizing a memory allocator. When I switched from using structs to using pointer arithmetic in macros, I gained a significant performance boost, even though the in-memory data was exactly the same.

zht · on June 16, 2013

were you compiling with optimizations on?

nitrogen · on June 16, 2013

Yes, gcc -O2 on Opteron I believe (it was a while ago).