A lot of folks on here are saying that this is cool but useless because there are better ways to click a button on a screen. If you read through their paper (http://groups.csail.mit.edu/uid/projects/sikuli/sikuli-uist2...) you'll find more practical examples of what can be done with this type of system.
One such example is to track real time images of a webcam pointed at a baby and using Sikuli to watch for a yellow dot placed on the baby's forehead. Another is to track movements of something across the screen; in this case a bus moving along Google maps.
I agree that there are better ways to do most of the things in their examples and that they should probably re-work their videos a bit, but just because this system doesn't solve your problems the way you want it to doesn't mean its useless.
I don't think anyone is saying the technology isn't cool (and I see no one using the word "useless" but you).
I believe most people are warning about the demonstrated technology for this task:
> Sikuli is a visual technology to search and automate graphical user interfaces (GUI) using images (screenshots).
As I said below, for personal scripting of known applications I think this GUI automation is great. But upon seeing it, we had all hoped to see a technology which would get over our biggest hurdles in GUI development/testing. As it doesn't sound like it would do that (well) for a number of reasons (i.e. localization, themes, OS/app versioning, design changes during development, coloring, etc), it loses a lot of its practical application appeal.
It's still really cool technology and is a cool way to think about this kind of problem.
I'm kind of surprised by all the nay saying. Yeah, I get the one big pitfall of this but it seems like one of the better solutions to a class of inherently hairy problems.
It sounds... scary. Like it will work well enough at first, and then explode when someone changes their desktop theme (especially icon theme), or wants to upgrade to a new version of whatever. Treating things as change-controlled APIs when they aren't just seems dangerous. Still I guess there's at least some amount of change control coming from platform conventions and human interface guidelines, and this comes closer to operating at the correct level of abstraction to benefit from that.
One problem with GUIs from the start has been that automating their behavior is fragile.
Automating actual keystrokes and mouse movements is the most fragile and kludgy of automations. Even drilling to the level of system messages is problematic.
Aside from it's other good qualities, the web is really nice for reducing possible user interactions to a very codified set of operations.
The article here points to ... the wrong way of doing things
I wondered about that, too---how fragile will this be in the face of minor GUI changes? (A fuzzy-match parameter would be a nice addition to the API.) One encouraging sign is that the screenshots are stored as .pngs in the file system. If an app update does nothing but change button graphics or text, the script could presumably pull new graphics from the same place.
That just leaves the other 90% of GUI changes that involve moving a setting to another screen, or changing the way an entire interaction works. ;-)
All the stuff you said about GUI automation being klunky is true. Still, this thing could come in handy as a kind of "GUI batch file," or perhaps as a tool to help produce screencasts.
The do say that it does fuzzy matching, so it can still work with minor changes.
I don't think any program guarantees that the GUI is a stable API! I do wonder about its ability to handle unexpected conditions, or for the scripter to know that these conditions exist in the first case. For example, if the GUI changes based on the day of week, the script will probably either get very confused, or do the wrong thing. I imagine there's some way to do conditionals (if you see this, then do thing 1, otherwise do thing 2) but it's hard for the scripter to discover all the possibilities. Still, might be useful for some things.
Also, GUI applications are generally stateful, so while you might generally get the same behavior from, say navigating to an options dialog, you're not guaranteed the same behavior - especially since the creator hasn't made any effort to get the same behavior from the same actions. (The web can the problem of statefulness but at least on the web you're supposed to avoid it).
Sometimes I see failures of HN to collaboratively discover this kind of interesting topics. I've posted the paper much earlier than the media coverage (more than 100 days ago): http://news.ycombinator.com/item?id=810986 and there is no vote up.
There are a lot of interesting papers out there but, from what I've seen, HN voters tend to appreciate actual products and services a lot more than theory (and certainly don't appreciate PDFs). There's undoubtedly room for a more compsci/theoretically HN equivalent - unless such a thing exists (Reddit/compsci?)
Nonetheless, that's a cool paper and it's a shame it didn't make the front page first time round :-)
Darn, that looks awesome. I suspect it died from the usual "launching a small online community" problems of hitting critical mass, etc. Maybe this - or something like it - could take off given the right amount of care by a group of dedicated folks..
The people dissing this have obviously never dreamed of automating a 16-bit Visual Basic 3 Windows app (that's Win16, not Win32) so it can be run from a webapp front-end and gradually obsolesced.
Autohotkey works, but matching by screenshots with computer vision would cut the amount of work required in half.
I can imagine this being useful for knowing to stop when things start going wrong. One problem I've had with GUI-automators in the past is that they've just kept automating after something unexpected happened and put them into an invalid state. It seems like Sikuli could avoid this by literally knowing when the screen looks wrong.
(Sometimes you should take a step back and ask yourself, "is looking for pictures on the screen really the best way to do this"? The example they show on the main page is a one-line "ifconfig" invocation, for example.)
Their example is a bit juvenile. Even so, keep in mind that it is a one-line ifconfig invocation for those that understand the terminal and know about ifconfig, which does not hold for the vast majority of users. This is about making automation more usable. I'm not sure if it is a productive one because my perspective is adulterated by technical knowledge. But at the very least, it is a very interested approach which could prove fruitful for your average user.
Sure, but there's also a precondition for using this: you have to understand that it's actually even possible to automate repetitive tasks on a computer, and have the motivation to do so. Hell, I know programmers who don't do that.
How many people fit that description, but will still find this easier than ifconfig? My guess is few.
I think its usefulness (or lack thereof) can't accurately be critiqued by anecdotes from technical users, but rather needs to be quantified through usability testing. It's easy to blow this idea off, but the "solve something your users didn't know needed to be solved" advice I occasionally see on HN keeps echoing in my head.
it is a one-line ifconfig invocation for those that understand the terminal and know about ifconfig, which does not hold for the vast majority of users
I know people who are just as much afraid of the terminal as they are of the network settings. And stepping someone through opening the terminal and typing something is approximately the same number of actions (in clicks and keystrokes) as navigating to the settings panel and typing in the info.
And while I agree that the example of modifying the network settings is contrived and thus not a good example of the usefulness of this setup, the difference between modifying network settings via the system settings GUI and from the terminal with ifconfig is that one will be remembered after a reboot and one will not be despite both having similar surface complexity to change. This is mainly because the actual settings are buried in some obscure, OSX specific file somewhere that most likely isn't even editable without the GUI (same with Windows (buried in the registry) and with Gnome's Network Manager (buried in ~/.gconf/system/networking/connections)).
Yeah, so this is the same as gnome's gconf settings for NetworkManager preferences.
But that being said, the plist file format isn't all that great (there was a time when Apple had binary "compiled plist" files, but the pl command no longer supports that) -- the alternating "key" and value tags only associated by order in the file is kind of anti-XML. On one of my Xserves with two interfaces active, this file is 9k, massive for its purpose. This is not something that even an experienced administrator would want to edit using anything other than the GUI, doing so would be quite error prone. So if you want your settings to stick, you should be doing it from the GUI on OSX. The complexity of doing it from the GUI in a way that sticks and from the command line in a way that doesn't stick are roughly the same, but the results are different: to get it to stick by editing from the command line is quite a bit more complex.
I’m not quite average but certainly closer to average than most here on HN, so let me just say that I found this immediately intuitive. I wanted to try it.
It’s a Rube Goldberg machine, but playing with those is always fun and intuitive.
(By the way, that was all very explicit and non-automatic, would it be possible to do this in some way where you just hit record and then click your way through? You would just have to snap screenshots whenever the user is clicking. Might be a problem with things that change their appearance on mouse-over like drop down menus, but I guess that’s solvable.)
Juvenile? That's a real problem people face every day because some designer/programmer failed to foresee it. This automation tool is a way to deal with the problem the way that user can understand. Sure ifconfig is superior way to do it bu it's beyond reach of most users. This kind of gui automation tool is something that empowers normal people to get around their problems.
I don't really "get" GUI automation. The only bugs that GUIs can have are "this is ugly", "this is spelled wrong", "this is confusing", and so on. You can't automate that testing away; a human will have to click through and tell you what he thinks.
The actual program logic is tested by the usual integration/unit tests, not by clicking buttons in the GUI. (And if you are worried about "what if clicking button foo doesn't run function foo", then you need to write more tests for the GUI generator library, not for your application.)
There are lots of counterexamples, where GUI bugs represent real functionality problems. Such as "option X is disabled under condition YWZ, though it shouldn't be". However, GUI tests should be a small subset of all tests, if not you run into very difficult problems, as most GUI test frameworks are very fragile and require extensive re-baselining after even relatively minor changes.
Though that's been said, there is much more than clicking button under the GUI. Tons of message passing through/to the window and it triggers a lot of system events which cannot be tested without mimic mouse/keyboard movements. Besides, some tests rely on the visual elements. See the Google Chrome blog for automatic GUI testing: http://blog.chromium.org/2009/02/distributed-reliability-tes...
Two wonderful things about this:
1. as frankenstein-ed together as the tech is, it works*
2. this is arguably more natural than 'workflow' recording functionality like automator, and I found the actual 'code' highly readable(although inscrutably hard to debug or test or _run_ without the IDE...)
All in all I love the way the idea works right now, although Java feels less than elegant on the Mac.
*(er, although for me it's got a killer bug - using the hotkey to make a screenshot does not work, gives no option of cancelling... hardcore crasher in my book)
This tool looks really interesting - and I love the idea that it can be programmed using python. I've used a number of GUI automation tools in the past like autohotkeys (which I can also highly recommend) - this one looks like it would make it easier to do certain tasks that are difficult in autohotkeys, for example: interacting with webpages or other applications that don't have standard interfaces that can be examined with system api's.
The screenshot approach this tool takes is very unique. My only criticism is that, judging by the video, the image processing approach seems slow compared to an autohotkey's script.
What I'm really waiting for is a tool that can take this one step further and do OCR on any on screen text. This would make it easy to interact with gui's that present text that can't be read using system api's - imho that would be the holy grail of gui automation.
If someone could take this concept a step further and let you create a self contained process that users could download and run just by clicking (like tasks in photoshop), I could see some uses:
- Some tech support situations where you have to have a user do x amount of steps on their computer that are the same for all users. Sort of like an automated Geek Squad.
- Sell a prepackaged GTD style organization system that creates all the folders for you in the right places, downloads files (pre-made budget spreadsheet for example) into them, etc. (trivial, but it's a pain point for people)
- Make a bunch of different productivity apps that mimic the steps a professional programmer/ photographer/ marketer etc does when they first setup a new computer (bookmarks, preference settings, etc.)
Clearly Sikuli has flaws, but for a research project, their presentation and execution is impressive. Their efforts should be commended. Hopefully they'll continue enhancing their scripting environment so that the scripts are robust to significant variation in the GUI.
Very cool, but would have major limitations outside of the just making a "personal script" or, at best, a script for a heavily locked-down enterprise/academic setup.
Because it uses literal images, it seems like any change in OS theme, OS version, app version, localization (e.g. text or control shape), or colors (e.g. high contrast mode) would break the scripts.
It'd be neat to use for GUI automation during software development except for the fact that the GUI changes, the button wordings are tweaked, etc.
In all of these cases, back-end or OS GUI automation is probably better, but if you have an unchanging environment or want a quick on-the-fly test, the screenshot approach is novel and probably a bit cooler.
Agreed the demo object is silly, but they are problems that are hard to solve without GUI automation. For example, this tool could be great for scrapping flash-based websites, which are notoriously painful to automate. And the integration with python means that you can easily mix and match with conditional statements, calls to OCR libraries, etc...
This is a much nicer and an intuitive alternative for http://autohotkey.com on windows. I've tried introducing autohotkey at work to automate some of the mundane tasks, but the learning curve of autohotkey was difficult for most of my co-workers. I'm going to introduce this at my workplace.
If you skip to the last 30 seconds of the first SIX MINUTE video tutorial, you can see the app in action. Otherwise, you have to sit through a whole class on how to use the app before you even know if you want to use it.
Little lesson in creating a good video demo....
Get to the point.
Then provide more videos for details.
(I guess you could say this should be expected from an MIT project website)
It looks like a more advanced version of tools like Quick Test Pro.
There is big money in tools like that, but I can tell you, its a real PITA to write test scripts using tools such as these. Given the option, you are better off exposing your app's object model to a scripting language, and letting testers script it like that.
Obviously that doesn't work for third-party or legacy apps. So it definitely has a market. And their computer vision algorithms have to be better than the godawful bitmap comparison tools that QTP used.
The best use case I can think of for this is writing automated test cases for a browser-based app. Selenium does a pretty good job of that already.
The demo (automatically setting an IP) is a one-time job. How many times do we have to do this task? So there is no need for me to automate those kind of jobs. But having said that, this could still be useful in some use cases. One example I could think of is testing desktop apps.
This is incredibly useful. That's why Redstone Software has been selling it for years, under the name Eggplant - see http://www.testplant.com/products/eggplant_functional_tester .
It takes a lot of work in QA to figure out why this is useful (back me up on t his one, experienced QA engineers) and the right way to do it so I'll give you the Cliff Notes:
This sort of bitmap recognition lets you automate that "last mile" QA groups can never seem to automate. autohotkeys, selenium, and other things all help automate lots of aspects of the interface with tons of caveats and gotchas. This is a much more useful, if less pleasingly elegant, solution.
When you are automating testing it's relatively easy to automate back-endstuff, write unit tests, write scripts wrapping cli interfaces and so on, but every automation team that deals with GUIs eventually stubs their toe on automating the user interface. BY having the computer automate the GUI task in the same way a human user executes it ( "I want to click the Apple Menu - Where is the Apple icon I know is on top of the Apple Menu? - Ah! There it is! I'll click it" ) you make it easier, or even possible, for the people writing the qa automation to automate the GUI in a reasonable amount of time.
There are some pitfalls. What if someone changes the theme on the automation rig? Well, you're an engineering team, not a preschool - DON'T change the theme!
What if somebody changes an icon in the app you're testing? Fortunately you have access to the bitmap (it's saved with the rst of the build files, yeah?) and of course the change notes for the build tell you hte iocon has been updated. Well, of course it isn't in the change notes, but when a test that was working fails you can easily run to the point where it says "Can't find the foo button." This is a hint to look for the foo button and think about why it can't be found.
Finally, all good scripting languages have an escape hatch to call otehr programs that can do things better than they can and return a result. Need to check an old COM object through its native interface? Write a small Windows app that your script calls to get that state.
It takes a lot of experience and frustration with trying to fully automate tests on a GUI to understand why this is useful. and the cry of "Bitmaps break because things cahnge" - well, no they don't. Not on a computer . Not if you know what you're doing and have control of the source. (Please disable all auto-update systems on your test rig or you will be surprised at some point.)
How often do you have to read and re-type an error message to Google, because the text cannot be copy and pasted? This technology could OCR the screen text and Google it for you automatically.
The demo video is proof of concept; make sure you read the paper.
I've had to use some non-scriptable, proprietary software that this might actually be useful for in doing repetitive tasks. This is especially true at some places where I have done some engineering consulting (non-software). It would probably fall in the category of ugly hack, but would also save some headache for me.
How well this would work for game playing bots? If this can abstract away the detection and clicking of regions, it would make building one much more approachable.
One such example is to track real time images of a webcam pointed at a baby and using Sikuli to watch for a yellow dot placed on the baby's forehead. Another is to track movements of something across the screen; in this case a bus moving along Google maps.
I agree that there are better ways to do most of the things in their examples and that they should probably re-work their videos a bit, but just because this system doesn't solve your problems the way you want it to doesn't mean its useless.