I hear this all the time and it boils down to what another poster said. I can mo...

idanoeman · on March 7, 2017

How do you know they don't

1. Listen all the time

2. Use a small neural network on the device to detect when a voice is present

3. Collect this data into one zip file, then send the file when the user says "Alexa", or anything remotely close.

They could even put a size limit on the data upload to reduce the variance to prevent you from ever testing whether they do this.

Or, they could simply transcribe the audio on the device and upload the text. Any audio they are unsure of could be uploaded to the server to be handled by a beefier neural network.

andrepd · on March 7, 2017

Yes, exactly this. They could easily just TTS what you are saying, save the text, and send it together with the rest of the info when you say "Alexa". Thus only sending information when you say "Alexa" but managing to upload all your conversations.

I would be very surprised if they aren't doing something like this. The power of analysing which products you talk about in your home more often, what kind of stuff you consume, what affairs do you discuss at home, etc, is too good to pass up. And seriously frightening.

throwaway2016a · on March 7, 2017

It is very well known that the actually speech processing happens on the cloud. To deploy a whole cloud voice recognition system if you have distributed network with TTS capability on the device would be quite a lot of redundant work to go through.

However, with that said, unless they do certificate pinning on their device the answer to that is to MITM the device and snoop on the traffic.

If they do certificate pinning the answer is:

1. Pre-record an Alexa commend

2. Play back the recording

3. Wait a minute

4. Replay the command

5. Measure the size of the packets going across the network

6. Wait a week while playing something that sounds like natural conversation - say an audio book

7. Replay the command audio file

8. Measure the amount size of data sent between the end of the second command and the end of the last

It should be slightly more than the second command was to account for things like checking for updates. But if it includes the TTS (which is essentially an audio book transcribed at this point) than it would be quite a bit larger even with text compression.

ben174 · on March 8, 2017

Any amount of text - when compressed - would be dwarfed by a number of things that may also be included in the data exchange, such as a software update. There's no way to conclude that a larger exchange of data means a big exchange of a week's worth of text.

Matthias247 · on March 8, 2017

Nit: TTS is Text To Speech, so you need the other way around :) I most often saw that abbreviated with SDS, although that's also not too correct since a speech dialog system also covers more than only speech recognition.