I found these the other day and I wonder how these have largely slipped under the radar. Of particular interest is hxpipe and hxunpipe which makes "scraping" tasks absurdly easy, by converting html to a form easily manipulatable by sed, grep and other fun unix utilities.
update:
tracking the score of this post on the front page using this:
You know what makes me sad now? That there doesn't seem to be anything like these for css files- in particular for extracting references to external files and images, and moving a css file from one directory to another while maintaining relative links.
hxpipe (1) - convert XML to a format easier to parse with Perl or AWK
Being unfamiliar with either Perl or AWK, could anyone point me to an explanation/ example of why it is easier to parse/ what format it generates. Would it be easy to write a similar utility to say convert it to a Lua table?
The idea is that those utilities work in the UNIX way, which means that they are line-oriented.
The following two xml documents are equivalent:
<a><b><c /></b><d>foo</d></a>
and
<a>
<b> <c /> </b>
<d>foo</d>
</a>
But to understand that using classical UNIX tools which are line-oriented is quite difficult, so you'll have a hard time doing operations such as "replace 'foo' by 'bar' if it appears as the textNode of a 'd' tag".
So the idea of hxpipe is that it is supposed to give you a line-oriented and similar representation of those two documents to work with.
But it actually fails to do that properly (at least for my taste). I largely prefer the output of xml2. Compare:
# first doc, output of hxpipe
(a
(b
|c
)b
(d
-foo
)d
)a
-\n
# second doc, output of hxpipe
(a
-\n
(b
-
|c
-
)b
-\n
(d
-foo
)d
-\n
)a
-\n
# output of xml2, for both documents
/a/b/c
/a/d=foo
This builds with gcc 4.8.2 (and earlier), since gcc stdlib has definition for min/max functions. But for clang you need to include MIN function/macro based on your system type (hxindex.c does not link without it).
Specifically on Mac OS X you need to modify the hxindex.c like this:
--- hxindex.c 2013-07-25 17:22:53.000000000 -0400
+++ hxindex.c.patched 2014-03-11 10:05:55.000000000 -0400
@@ -43,6 +43,7 @@
* Version: $Id: hxindex.c,v 1.20 2013-07-25 21:04:05 bbos Exp $
*
**/
+#include <sys/param.h>
#include "config.h"
#include <assert.h>
#include <locale.h>
@@ -439,7 +440,7 @@
/* Count how many subterms are equal to the previous entry */
i = 0;
- while (i < min(term->nrkeys, globalprevious->nrkeys) &&
+ while (i < MIN(term->nrkeys, globalprevious->nrkeys) &&
!folding_cmp(term->sortkeys + i, 1, globalprevious->sortkeys + i, 1))
i++;
Basically, you need the sys/param.h include and change the min function calls to MIN.
AFAIK MIN macro is not standard (not sure though), so I made this:
--- hxindex.c.orig 2014-03-11 17:55:17.305697689 +0200
+++ hxindex.c 2014-03-11 17:58:30.331318646 +0200
@@ -103,6 +103,10 @@
#define SECNO "secno" /* Class of elements that define section # */
#define NO_NUM "no-num" /* Class of elements without a section # */
+#ifndef MIN
+# define MIN(X, Y) ((X) < (Y) ? (X) : (Y))
+#endif
+
typedef struct _indexterm {
string url;
int importance; /* 1 (low) or 2 (high) */
@@ -435,7 +439,7 @@
/* Count how many subterms are equal to the previous entry */
i = 0;
- while (i < min(term->nrkeys, globalprevious->nrkeys) &&
+ while (i < MIN(term->nrkeys, globalprevious->nrkeys) &&
!folding_cmp(term->sortkeys + i, 1, globalprevious->sortkeys + i, 1))
i++;
Hmmm weird, I can't compile on Mac OS (clang,) complains about undefined iofuncs in openurl. Can't figure out what it is exactly missing: seems to be a library, but where and why? I could install the Homebrew version, but it seems to have a bug that makes hxselect not work correctly :/
That's also the impression I got from the sources, and I have libcurl. I'm quite puzzled at where the problem is coming from. Just did a system upgrade (I was a minor release of Mac OS behind) and a brew update and upgrade and I'm still missing this minor "something." Oh well, the only thing that does not work correctly is hxselect... I just faked it with grep.
Mh, interesting, I need to check this out. Currently I'm using xml2 and 2xml and classic unix tools (sed, grep, cut…) to deal with HTML in Bash scripts and Makefiles (this is how my personnal website is regenerated automatically when I commit or push modifications, by calling `make` in the corresponding git hooks).
update: tracking the score of this post on the front page using this: