I hear this all the time and it boils down to what another poster said. I can monitor all the traffic going through my network. If the device was constantly streaming audio it would be very obvious.
To which someone once replied to me: "But what if they aren't sending traffic through your network? What if they are using 4G or something like that?"
To which I replied that while I have not personally done it. You could scan using software defined radio to detect that sort of thing. And if it was doing that someone would notice. Plus if you tore down the hardware they would notice the antenna.
To which they replied "what it it uses something you can't detect with that?"
To which I walked away because I didn't feel like explaining how physics works.
2. Use a small neural network on the device to detect when a voice is present
3. Collect this data into one zip file, then send the file when the user says "Alexa", or anything remotely close.
They could even put a size limit on the data upload to reduce the variance to prevent you from ever testing whether they do this.
Or, they could simply transcribe the audio on the device and upload the text. Any audio they are unsure of could be uploaded to the server to be handled by a beefier neural network.
Yes, exactly this. They could easily just TTS what you are saying, save the text, and send it together with the rest of the info when you say "Alexa". Thus only sending information when you say "Alexa" but managing to upload all your conversations.
I would be very surprised if they aren't doing something like this. The power of analysing which products you talk about in your home more often, what kind of stuff you consume, what affairs do you discuss at home, etc, is too good to pass up. And seriously frightening.
It is very well known that the actually speech processing happens on the cloud. To deploy a whole cloud voice recognition system if you have distributed network with TTS capability on the device would be quite a lot of redundant work to go through.
However, with that said, unless they do certificate pinning on their device the answer to that is to MITM the device and snoop on the traffic.
If they do certificate pinning the answer is:
1. Pre-record an Alexa commend
2. Play back the recording
3. Wait a minute
4. Replay the command
5. Measure the size of the packets going across the network
6. Wait a week while playing something that sounds like natural conversation - say an audio book
7. Replay the command audio file
8. Measure the amount size of data sent between the end of the second command and the end of the last
It should be slightly more than the second command was to account for things like checking for updates. But if it includes the TTS (which is essentially an audio book transcribed at this point) than it would be quite a bit larger even with text compression.
Any amount of text - when compressed - would be dwarfed by a number of things that may also be included in the data exchange, such as a software update. There's no way to conclude that a larger exchange of data means a big exchange of a week's worth of text.
Nit: TTS is Text To Speech, so you need the other way around :) I most often saw that abbreviated with SDS, although that's also not too correct since a speech dialog system also covers more than only speech recognition.
To which someone once replied to me: "But what if they aren't sending traffic through your network? What if they are using 4G or something like that?"
To which I replied that while I have not personally done it. You could scan using software defined radio to detect that sort of thing. And if it was doing that someone would notice. Plus if you tore down the hardware they would notice the antenna.
To which they replied "what it it uses something you can't detect with that?"
To which I walked away because I didn't feel like explaining how physics works.