So..300 mio connected devices, 2 bytes for ___location, some compression, 2 samples...

metahost · on May 18, 2018

Would NSA require a third party website's API to amass phone geo-___location data considering the fact that they might as well exploit the carrier provided APIs?

bastawhiz · on May 18, 2018

Two bytes for a ___location? If you're going to store latitude and longitude, you'll want to do it with signed fixed-precision numbers. Four decimal places if you don't care about accuracy too much plus up to 180 above and below zero. That means you'll need 26 bits (25 bits to store the data plus one bit for sign), and that covers one half of latitude and longitude. You'd need 52 bits to store a full set of coordinates, and since we're not barbarians we can round that up to a nice even 64 bits (I don't think there's anything relevant you can store in 12 bits...maybe carrier ID?)

Then you need to identify what device it's for, that's another 32 bits if you're being conservative. I'll ignore timestamp, since you could infer that from the ___location in the data store (assuming you have nice even 30s intervals), but it would otherwise be a bonus 64 bits.

That's 10TB, but also pretty useless because you've got a giant pile of u32s. If you were going to do this for real, you'd probably store the user's ___location with full precision for the first sample, then simply have diffs with a precision that assumes, say, the device isn't going to move faster than the speed of sound. The index is likely going to be pretty massive, and you'll constantly be reindexing because you're getting new data all the time (with a cardinality of 300M).

Even then, you're probably looking at something that's triple or quadruple that size. Let's say 40TB. That's only 14.6PB/year, which is not unreasonable storage for the government. But then you need to consider other things:

- You're ingesting 600M data points per minute. If you're only getting the minimum (96bits/user), that's over 7GB/s.

- But of course, it's probably not in a binary format (in the article, it was JSON and XML). Make that 50GB/s.

- Now you're parsing XML and JSON. That's not free when you're parsing 600M documents/minute. You'd likely have a pretty damn big server farm to ingest all this data.

- Your carriers need to be able to send that data to you, so they need to be equipped to (collectively) deliver 50GB/s of encoded ___location data; not to mention be able to dump the latest record for every customer in their ___location database twice per minute.

I wouldn't bet money that the government cares enough about every last device in the country to rig together such a big system only to keep track of where my grandmother's Jitterbug is. It's nonsensical to think that they couldn't just send an API request to AT&T or Verizon and be like "Hey guys, where's John Doe's phone been in the last week?" That's almost certainly more practical and cost effective in every way.

beezle · on May 18, 2018

You are right, it would take a few more but my point was more that the amount of data is entirely within the scope of NSA et al. In fact, it was somewhat arbitrary to say two points a minute, reality is once every two minutes is enough to have a very good idea where somebody is or has gone to.

Why would they want this data? Because they want to be able to not just track in real time 'persons of interest' but also to recreate past movements to establish associations.

As to the difficulty (if any) for the major telecoms to provide such a feed - chump change to the NSA to assist in that matter, just like regular law enforcement pays for wiretaps.