It is very well known that the actually speech processing happens on the cloud. ...

It is very well known that the actually speech processing happens on the cloud. To deploy a whole cloud voice recognition system if you have distributed network with TTS capability on the device would be quite a lot of redundant work to go through.

However, with that said, unless they do certificate pinning on their device the answer to that is to MITM the device and snoop on the traffic.

If they do certificate pinning the answer is:

1. Pre-record an Alexa commend

2. Play back the recording

3. Wait a minute

4. Replay the command

5. Measure the size of the packets going across the network

6. Wait a week while playing something that sounds like natural conversation - say an audio book

7. Replay the command audio file

8. Measure the amount size of data sent between the end of the second command and the end of the last

It should be slightly more than the second command was to account for things like checking for updates. But if it includes the TTS (which is essentially an audio book transcribed at this point) than it would be quite a bit larger even with text compression.