Lua is a good place to start! And the VM minus parser is useful, though it may make debugging harder. You simply write out the bytecode and upload that.
The VM is only part of the problem, though. I consider the bigger problem to be memory management. A garbage collected language operating reliably in 128KB? I'm deeply skeptical.
While I don't think Lua would be the choice, it's the quantity of garbage that matters:
"Historically, languages intended for beginners, such as BASIC and Logo, have often used garbage collection for heap-allocated variable-length data types, such as strings and lists, so as not to burden programmers with manual memory management. On early microcomputers, with their limited memory and slow processors, BASIC garbage collection could often cause apparently random, inexplicable pauses in the midst of program operation. Some BASIC interpreters such as Applesoft BASIC on the Apple II family, had terribly inefficient garbage collectors for strings which repeatedly scanned the string descriptors for the string having the highest address in order to compact it toward high memory. This one-string-at-a-time processing loop resulted in O(N * N) time performance in the number of strings, which would introduce a pause more than one minute long into the execution of string-intensive programs. A replacement garbage collector for Applesoft BASIC published in Call-A.P.P.L.E. (January 1981, pages 40–45, Randy Wiggington) identified a group of strings in every pass over the heap, reducing a pause of two minutes into less than a second depending on the size of the group. Other approaches were published, but none ever made it into a new revision of the BASIC interpreter."
Garbage collection isn't necessarily that difficult, especially when you're in a single-threaded (or cooperatively multithreaded) environment and don't have to contend with an OS. Here[1] is a cons-pair system I wrote in Forth which uses tag bits to identify pointers, a mark-and-sweep garbage collector and a free-list style allocator. GC takes linear time with respect to heap size, and allocations are constant time. The allocator itself (not counting all the list utility routines) is about 60 lines long and fairly easy to understand.
The VM is only part of the problem, though. I consider the bigger problem to be memory management. A garbage collected language operating reliably in 128KB? I'm deeply skeptical.