The mystery of the fifteen-millisecond breakpoint instruction

userbinator · on Sept 5, 2015

What exactly is the purpose of logging all unhandled breakpoint instructions encountered in user-mode code to the syslog, and then continuing...? On Windows, a breakpoint instruction encountered outside of a debugger causes the standard error dialog to appear, and if you choose not to attach a debugger, the process terminates. IMHO that's more reasonable behaviour, as breakpoint instructions shouldn't appear in code that's not being debugged (either by a debugger or itself, as is sometimes the case like the author does here.)

because unprivileged user-space code is able to force the kernel to waste a lot of time printing messages, slowing the whole system down

User code can do that already, or even just run in an infinite loop.

dezgeg · on Sept 5, 2015

    >>because unprivileged user-space code is able to force
    >>the kernel to waste a lot of time printing messages,
    >>slowing the whole system down
    >User code can do that already[...]

I'm pretty sure that'd be considered a kernel bug. After all, there already is a a function (printk_ratelimited) which is supposed to be used in situations where userspace can trigger a kernel print.

userbinator · on Sept 5, 2015

I meant that you can already do these if you want to consume a lot of CPU time:

    for(;;) printf("Hello world!");

    for(;;) ;

seba_dos1 · on Sept 5, 2015

Consuming a lot of CPU time and forcing kernel to do that are separate things.

madez · on Sept 5, 2015

But that won't slow down the whole system, if you didn't mess with the process priorities.

lultimouomo · on Sept 6, 2015

sync() will most definitely do.

caf · on Sept 5, 2015

It doesn't continue, it delivers a SIGTRAP to the process and the default disposition of that signal (if the process isn't being debugged or has otherwise arranged to handle the signal differently) is to have the process dump core and exit.

caf · on Sept 5, 2015

I'm surprised the author didn't submit this patch back to the Linux ARM architecture maintainers - it seems entirely apropos.

raverbashing · on Sept 5, 2015

Yes, or at least a discussion on why this instruction is not supported (might be just a slip-up or have a bigger underlying reason)

makomk · on Sept 5, 2015

There's apparently an even worse usermode-triggerable denial of service in the Pi related to breakpoints, one that allows usermode code to completely hang it: http://hackaday.com/2015/08/25/raspberry-pi-halt-and-catch-w...

glandium · on Sept 5, 2015

This reminds me of one thing I hit 2 years ago, where several android mobile devices would be slow to call signal handlers, because the kernel was booted with user_debug=31(on production devices), which printk stack traces before calling signal handlers. The kernel command line argument was hardcoded on some code aurora branches for Qualcomm boards, iirc.

evmar · on Sept 5, 2015

He wrote:

> Think about the common debugging scenario where the user sets a conditional breakpoint: "break if x > y". Testing that condition is going to take 15ms each time.

Wouldn't the implementation of a software breakpoint look like:

    if x > y: bkpt

so it'd only be slow on the case where it needed to break? I guess I can imagine you could implement it the other way (always break, check the condition after breaking, resume if the condition isn't met) but it's not obvious to me why you'd do it that way.

Just guessing, but if you're just patching the binary to insert the breakpoint you might not have enough space? But you need space for the "bkpt" instruction too...

moyix · on Sept 5, 2015

Typically the condition testing for a conditional is done inside the debugger, not inside the patched code. The reason for this is that the conditions can be quite complex, and require debug symbol resolution. And this takes up a lot of space.

The issue of space is actually a bit subtle. Imagine you have something like (assuming 4 byte fixed width instructions, which is the case on ARM):

   0000 instruction 1
   0004 instruction 2
   0008 instruction 3
   000c jmp 0004

Breakpoint instructions are generally designed so that they only take up one instruction's worth of space. On x86, the opcode is just "0xCC". Once you start adding in more complex code, you will necessarily use more instructions. Now, suppose your new breakpoint takes up two instructions (bytes 0-8). When the jump at 000c takes place, it will jump into the second of your two instructions, which will probably not do anything good (on x86 this is even worse because you may jump into the middle of an instruction!)

There's also the practical issue – if you put the conditional test in the patched code, you would need to basically bundle a compiler with gdb so that it could compile the conditional testing code in order to create the patch.

anon4 · on Sept 6, 2015

The very latest versions of gdb actually can use gcc to compile code and allow you to hotpatch code into the executable. https://sourceware.org/gdb/current/onlinedocs/gdb/Compiling-... so I'm guessing maybe in the future this can be used for faster conditional breakpoints

derf_ · on Sept 5, 2015

I don't know what gdb does, but even on x86 a conditional breakpoint can be ridiculously slow. On a loop that executes a few hundred thousand times before meeting the condition, it can take minutes to evaluate something that executed instantly before.

I often go back, edit the code to add something like

  if (condition) {
     foo = foo;
  }

recompile, restart the debugger, and break on the foo = foo line. Even with all of those steps, that's still faster than using a conditional breakpoint most of the time.

spc476 · on Sept 6, 2015

Worse case scenario---the CPU only supports a breakpoint instruction. Doing a conditional breakpoint involves the debugger setting a breakpoint instruction at the next instruction, resuming, then resetting the breakpoint and setting a breakpoint at the next instruction (with special care taken for calls, jumps and conditional jumps) and doing this, for every instruction until the condition is met.

Second worst case---the CPU has both breakpoints and single stepping mode (the 8086 supports this). In this case, the debugger no longer needs to set a breakpoint on every instruction, but the CPU still endures an interrupt per instruction until the condition is met.

Best case---the CPU has a way to specify a condition in hardware (the 80386 and above can do this to varying degrees) but often you have a limited number of conditions that can be handled and if you exceed this, then you either have to disallow it, or fall back to one of the previous methods (or, perhaps depending upon the condition, you can play around with page settings to capture reads or writes but that has its own overhead).

comex · on Sept 5, 2015

Incidentally, GDB can sometimes evaluate breakpoint conditions on the target:

https://sourceware.org/gdb/onlinedocs/gdb/Conditions.html (search for "target's side")

I don't know what targets support it, though. (Pretty sure OS X isn't one of them.)

revelation · on Sept 5, 2015

For next time, Windows has the DebugBreak function which will stop the program and (if you have a debugger attached) act the same way as stopping on that line.

On Linux, I've seen some people use asm("int 3") (which is what a debugger would use for software breakpoints).

notacoward · on Sept 5, 2015

There's something extra-satisfying about debugging a debug facility. Nice detective work.

AceJohnny2 · on Sept 5, 2015

Very tangential to the topic at hand, it makes me happy to see a somewhat lucid comment on that article by Terry A Davis.

(For those who lack context, Davis is a brilliant but schizophrenic programmer who has some notoriety on HN: https://news.ycombinator.com/item?id=7818823 More info at http://motherboard.vice.com/read/gods-lonely-programmer )

TheLoneWolfling · on Sept 5, 2015

Wait... It takes ~1ms to write a line to a log file?

a1k0n · on Sept 5, 2015

My theory is that it's also being written to the RPi's serial port console, which runs at 115200 baud by default, so 11520 characters/second or ~7ms to write 80 characters.

TheLoneWolfling · on Sept 5, 2015

Good theory. Alas, the author mentioned it took ~1ms, which would be ~11.5 characters, and those log messages are a lot longer than 11.5 characters.

tonyg · on Sept 5, 2015

If I understood the article correctly, the ~1ms cost came from having rsyslog running, not from printk. Killing rsyslog led to a ~1ms speedup, but it was removing printk() that led to the rest.

Printing to the console on the RPi could involve either or both of the serial port and the software-character-generator-based text-mode video output. Depending on screen resolution, the video output might be quite slow to print to, especially if it had to scroll a line, depending on exactly how it's implemented.

TheLoneWolfling · on Sept 5, 2015

In which case, what on Earth is rsyslog doing that it takes 1ms to append a line of text to the log file?

I could see it if it was outputting something to software-based video - except that in that case, why is it trying to output to software-based video?

simoncion · on Sept 5, 2015

> ...except that in that case, why is it trying to output to software-based video?

I don't have much experience with rsyslog, but syslog-ng can be configured to log to all consoles. [0] I know that consoles are usually configured to blank themselves after a period of inactivity. I wonder if a blanked console can be written to faster than an non-blanked one. [1]

Edit: The Fine Article supports my assumption about how rsyslog is configured on the RPi. Search for "printk". Also remember that -when syslog is disabled- printks are written to the active console.

[0] The default configuration appears to log everything to /dev/tty12. I see no reason why you couldn't log everything to -I guess- /dev/console. (Or, failing that, /dev/tty[0-12].)

[1] I know that screen-scrolling updates that happen in a console that is not the current one (or that happen in a screen window that's not the current one) are substantially faster than on-screen screen-scrolling updates.

JoachimSchipper · on Sept 5, 2015

You need quite a few kernel<->userland switches in this case, maybe that's enough?

TheLoneWolfling · on Sept 5, 2015

Potentially, although in that case it's still dsyslog's fault.

ars · on Sept 6, 2015

It's classed as an ALERT level message, so rsyslog will sync after outputting it. That might account for the time.

exDM69 · on Sept 5, 2015

Since it's a debug log from the kernel, it makes sense to take the time to make sure it gets all the way to the disk. For many other purposes, it's fine to be quick and just do the best effort. But the kernel log is a vital resource for doing post mortem analysis if everything goes wrong and the kernel panics.

It's better to be slow than unreliable in this case.

TheLoneWolfling · on Sept 5, 2015

Except that it's relatively simple to make an append-only file both quick to write and reliable on flash...

It seems that both the file system and flash controller punt the relatively common (and relatively simple) case of a file being repeatedly appended to by small amounts.

anonyfox · on Sept 5, 2015

disk is slow, seems reasonable.

TheLoneWolfling · on Sept 5, 2015

Don't forget the Pi (generally) uses flash. So seek speeds are irrelevant.

If you're writing the following in 1ms:

    Apr  4 22:50:59 heating kernel: [14572.940927] Unhandled prefetch abort: breakpoint debug exception (0x002) at 0x00008438

By my count that's ~1 kilobit.

That means that you're writing to flash at ~1 Mb/s. (~125KB/s) That's slow, to put it mildly. (~5% of what it should be, according to http://ozzmaker.com/2012/12/27/how-to-test-the-sd-card-speed... ) I could see that kind of speed if it was causing a disk seek every write (although if it was doing that you're probably using the wrong file system), but again, the Pi uses flash.

So what gives? Is there braindead sync behavior that's causing it? Is it hitting a pathological case in the flash controller? Or what?

(For that matter: is there a file system that is optimized for log files?)

tjoff · on Sept 5, 2015

Just read the second comment on the link you provided and you have the answer.

TheLoneWolfling · on Sept 5, 2015

> Nowaday’s fast Class 10 cards achieve their high read and write speeds only by buffering in advance, assuming that the subsequent blocks will be read or written next. That means they inevitably sacrifice random read and write speeds. This effect is in the order of several magnitudes. For instance, a card with sequential read and write speeds of 10 MB/s might collapse to just 0.01 MB/s for random access.

Yow. Yet another example of over-optimization to benchmarks.

...Except that, even then, how exactly is appending to a log file repeatedly random access? Bad journaling?

tjoff · on Sept 5, 2015

This is much closer to random access than sequential access. Counter question: How is appending to a log file repeatedly sequential access?

No meta data or other interaction required? Surely the sequential writes benefit from larger buffers? Not to mention that the flash block is way larger than the appended message. You would need TRIM to not make that an extremely wasteful operation, not sure if the typical Pi installations make use of it but a quick google seems to indicate that it at least has been a problem. So writing that 1 kbit now means reading a whole block, edit the content of it, and writing that whole block back - which is hardly sequential either. 1 Mbit/s is starting to seem pretty fast.

The only pathological case here is the "not optimal sequential write case", which sure is enough to kill any random USB-thumbdrive or SD-card. Just copying two files two a thumdrive at once can kill performance by an order of magnitude compared to doing them one after another.

TheLoneWolfling · on Sept 5, 2015

This is where you benefit by having a file system (or disk controller) that isn't brain-dead.

Unfortunately, it seems like both the controller and file system are brain-dead on the RPi.

Namely, you have a cache that detects this sort of (relatively-common) append-only operation, and buffers it until it makes sense to actually write it (either because the disk is otherwise unoccupied or because you are close to the limits of your power buffer).

Alternatively, you have a file system that actually knows the advantages and limitations of flash. You can do append-only files efficiently on flash (without doing all of the read + write dance). "All" you need to do is take advantage of a bit of scratch space and knowing how flash works. Namely that you don't necessarily need to flash a block to write to it, as long as you are only turning zeros to ones (or vice versa, depending on how physical bits are mapped to logical bits. Multilevel flash is a little more complex, but still doable). (For instance, instead of having one filestamp, you have space for 16 filestamps, with two bytes at the beginning, all initially set to all zeros. You check the last bit set and use that filestamp. Congratulations, you now can update the file 16 times before needing to reflash the block. If that's not enough, you can extend it arbitrarily (note that you can do this dynamically for files only when they are updated enough to warrant it!) And metadata/etc can be done in the same way. For something like a log file, you have, say, 16 slots for how much of the block is occupied, and a pointer to the next block (or null). You append the content of the new log line to the file and add the new content. One write, no blocks being flashed. (Two writes if/when you run out of space in the block.)

See, for instance, https://github.com/bnahill/FLogFS/ or https://en.wikipedia.org/wiki/F2FS/

salgernon · on Sept 5, 2015

The sole purpose of the pi is not to write kernel log files. I'm sure if you want to build a piece of hardware / software kit that did nothing but write sequentially to a flash drive, that'd be blazingly fast.

But if you want log rolling, filtered output, serial output, non-flash file systems (nfs, etc) and a general purpose computing device where printk is the rarest of rare events (in the lifetime of a task) then you're going to lose some of your optimizations.

TheLoneWolfling · on Sept 5, 2015

Nevertheless, writing (essentially) append-only files is something that happens often enough that most filesystems should cope with reasonably well. And writing 20x slower than it could is nowhere near coping reasonably well.

(Also: note that you don't need to cope with nfs/etc. Flash is different enough from spinning rust that I have no problem with a filesystem being optimized for one or the other as opposed to both.)

tjoff · on Sept 5, 2015

I'm going to guess that the Pi uses ext3/4 which I'm assuming does pretty well with appending files on a spinning disk.

Yes, we need something tailored for flash and the advent of SSDs have put some effort into that. However, the controller of a SD-card is for understandable reasons not as advanced as one on an SSD. One of the issues that seemed prevalent of the Pi was that SD-cards, when issuing a TRIM command, just erases the flash - immediately. Which of course kind of defeats the purpose and leads to bad performance (trying to optimize for future writes by sacrificing current writes).

The recommendation that I saw was to schedule a pass clearing free space for when the Pi wasn't in active use.

Problem with the controller doing some of the stuff (which I guess is inevitable due to different design needs) and the file system doing some other stuff is that when the filesystem is paired with a bad controller (but one that still does some kind of wear leveling) you are kind of screwed anyway, because either implementation can not rely on the other. So, just use an SSD instead?

I guess it's worse on the Pi since it's somewhat trying to be a workstation on mobile hardware. And sometimes I get the impression that even phone makers can't tailor it well anyway (samsungs android filesystem for the older phones were apparently pretty awful for instance). And android only got TRIM support in, I believe, Kitkat? But only if the whole chain of device hardware and drivers supported it (so god knows which current devices actually support it)... Learning that was quite shocking to me and probably a large source of "my phone just keeps getting slower" problems. But who cares if it just accelerates sales of new phones?

TheLoneWolfling · on Sept 5, 2015

Or you have one filesystem that is geared towards dumb flash, and another that is geared towards smart flash.

I don't see why people try to design filesystems that do everything. As usual for hybrids, they try to do everything and as a result don't work well anywhere.

As long as you keep the limitations of dumb flash in mind (namely, that flipping bits in one direction is slow (things are a little more complex on MLC, but still doable)), it's surprisingly easy to design a filesystem that does well on "dumb" flash.

tjoff · on Sept 5, 2015

This is a tangent but you seem to be under the impression that filesystems grow on trees. I've been waiting for a modern COW linux filesystem for almost a decade now. Maybe in a few years it's mature enough to wait a few more years for it to become stable? (I'm looking at you BTRFS) ;)

I guess BTRFS would fall into the filesystems that do everything category... Though at the other side of the spectrum you have filesystems that have not been tested enough or isn't maintained.

simoncion · on Sept 5, 2015

FSs designed to run on dumb flash exist: https://encrypted.google.com/#q=site:lwn.net+nand+flash+file...

> I don't see why people try to design filesystems that do everything.

For general use, [0] I would rather use a FS that works on pretty much every media type [1] -and has reasonable perf- than to have to worry about whether $SPECIALTY_FS is actually tuned for $WEIRDO_BLOCK_DEVICE, or if I've failed to understand the particulars of either the FS, device, or both.

[0] That is, when getting the absolute best perf isn't a requirement.

[1] Except for like, tape, and optical RAM/WORM drives, natch.

TheLoneWolfling · on Sept 5, 2015

I don't mind having a filesystem that works reasonably well on most devices. But I would consider "being able to append to a file reasonably quickly" as part of that "reasonably well".

Note that it's entirely possible to have a filesystem that's actually several different filesystems with a magic selector. (I mean, block sizes and reserved block percentages and inline inodes already are along those lines)

simoncion · on Sept 5, 2015

If your underlying device is slow -as SD cards are notorious for being for any use case other than picture or video storage and retrieval-, no amount of FS juju can help you.

on Sept 5, 2015

[deleted]

RobotCaleb · on Sept 5, 2015

Is that necessary? Do try less hard, please.