Replibyte – Seed your database with real data

ff7c11 · on July 11, 2022

Trying to think how to anonymise datetimes hurts my head. You might want to randomise the date of an event. But you also need this random date to be consistent with respect to both the current time and the order of other related rows in the database.

lstamour · on July 11, 2022

The answer is always “it depends,” but I think if a date time is a UTC timestamp, such as a record of when an event happened, then with random sampling, it shouldn’t matter? It’s just a timestamp. The amount of information it contains might include ___location, might include timing to other events, could be correlated, but… on its own? It doesn’t need anonymization. Likewise the sequence of events, should be safe to use.

I get that you can look up or de-anonymize an event by its timestamp and the same is true of ID numbers. But it’s worse for ID numbers because these are often permanent and re-used for multiple events.

But yeah, the risk in anonymized data is that it’s never truly both anonymous and useful. Truly anonymous data might be considered junk or random data.

Anonymized data has some utility purpose to fulfil. Perhaps “realistic” analytics is required, or you want to troubleshoot a production issue without revealing who did what to engineers. So you anonymize the fields they shouldn’t see, and create a subset of data that reproduces the issue…?

Anonymized data is almost always a bad approach compared to generating data from algorithmic or random sources, but sometimes we need anonymized or restricted data to start that process.

BobbyJo · on July 11, 2022

Data can be anonymous and useful. You however have to define what you mean by useful, an use that to inform how you go about making it anonymous.

A good example is: https://gretel.ai/blog/gretel-ai-illumina-using-ai-to-create...

Full disclosure, I work at Gretel, but I thought this was relevant enough to mention.

lstamour · on July 11, 2022

True? But I wouldn’t call creating new data from non-anonymous data “making data anonymous”. Instead, that’s new random data whose values are constrained or based on real-world data. I’d call that newly generated data, not anonymized data.

To me, anonymized data has an inherent risk of leaking the original transaction because it is a one-to-one mapping of the original data. If you generate new data, it will by definition diverge from the production dataset in some way that might be unrealistic. For example, fields with address components might not actually point to real places, or might not be written the same way as they would be in production. Perhaps a portion of production data includes international addresses or rural routes that your software might fail to generate, or worse, maybe it would generate them incorrectly.

Frankly, generating data is a better approach than anonymized data. And I know of anonymization techniques where good data is mixed with bad data and statistically, the bad data can be filtered out later but only in aggregate, etc. But I’m drawing a line in the sand between anonymized data that closely matches real data, and that which is “generated data”, because you can still potentially learn from the anonymized data but you can’t learn from generated data much more than you would from the initial model that created the constraints used to generate the data. I’m probably explaining this poorly, it’s a bit late at night in my time zone. :)

bennyp101 · on July 10, 2022

How does it keep personal data safe? I had a look at “how it works” and “faqs” but they don’t answer how you keep stuff safe? It also gets uploaded to S3?

I might have missed it, but I need to know exactly where our PII is stored (so not on a dev laptop), how do you know what to replace and what do you do with any info you do replace?

Edit: To answer my own question, via transformers. But that seems to suggest each dev has to keep it up to date with any schema changes etc

(Also some links are broken on GitHub)

crummy · on July 10, 2022

The user tells it what fields need replacing with the yaml config.

ev0xmusic · on July 10, 2022

Hi, author of Replibyte here :)

Yes, transformers is the way to go. I plan to add a way to detect schema changes and at list not trying to create a dump in case of change. I don't think it can be done in a safe way without human admin check.

(Thank you for your PR)

pistoriusp · on July 10, 2022

You may want to check out Snaplet at https://docs.snaplet.dev. I'm the co-founder, but we're not open-source (yet.) Our goal is to give developers a database, and data, that they can code against.

We identify PII by introspecting your database, suggest fields to transform, and provide a JavaScript runtime for writing transformations.

Besides transforming data, you can reduce, and generate data. We are most excited about data-generation!

The configuration lives in your repository, and you can capture the snapshots in GitHub Actions. So you get "gitops workflow" for data.

A typical git-ops workflow:

  1. Add a schema migration for a new column. 
  2. Add a JS function to generate new data for that column.
  3. Add core to use the new column.
  4. Later, once you have data, use the same function to transform the original value. (Or just keep generating it.)

roskilli · on July 10, 2022

One feature I’d love to see is a transformer that instead of providing a random value provides a cryptographic one way hash of the data (ie sha2) - that way key uniqueness stays the same (to avoid unique constraints on columns) and also the same value used in one place will match another value in another table after transformation which more accurately reflects the “shape” of the data.

pistoriusp · on July 10, 2022

We do this via Copycat (https://github.com/snaplet/copycat). We generate static "fake values" by hashing your original value to a number, and map that to a fake-value.

MadsRC · on July 10, 2022

This will not work, at least not if we’re talking PII as it is defined by a Somewhat Sane (TM) privacy legislation.

Sure, passwords and credit card info is obscured with your methodology, but names, dates of birth, sexual orientation, telephone numbers, email and ip will remain unique. This uniqueness is what allows you to potentially identify a person given enough data.

tyingq · on July 10, 2022

>Sure, passwords and credit card info is obscured with your methodology

Even that's problematic, because there may be code that depends on the data being somewhat "real". Credit cards, for example, may need to pass LUHN tests, or have valid BIN sections, etc.

MadsRC · on July 10, 2022

I suppose that what you’d have to do is change the data and then hash it. But once you’ve changed the data it’s no longer PII, so there’s no reason to hash it.

Of course, given enough data that has been changed can potentially allow you to deduce how that data was changed and thus revert it, at which point it would become PII again and you’d have a problem… but that’s probably a fringe scenario

BobbyJo · on July 11, 2022

I hate to be so self promoting (I swear I'm just trying to be helpful), but Gretel has that as a transformer you can use[0]. You can test out a lot of our stuff without payment info through our console[1] if you just want to mess around and see if tools like it ( and Replibyte of course :) ) would fit your use case. That being said, you can run into issues using direct transforms like this, depending on the correlated data, because of various known deanonymization attacks. There are some pretty gnarly examples out there if you Google around.

[0]https://docs.gretel.ai/gretel.ai/transforms/transforms-model...

[1]https://console.gretel.cloud/login

cratermoon · on July 11, 2022

What you're asking for is similar to what goes by the term "tokenization"[1], a technique often used by payment processors to avoid leaking credit card numbers and similar sensitive data. Using the proper transformer might provide the behavior you need.

1 https://www.tokenex.com/resource-center/what-is-tokenization

ev0xmusic · on July 10, 2022

Hi, author of Replibyte here. Feel free to open an issue and explain what is your use case. I will be happy to consider a solution with the community.

zX41ZdbW · on July 11, 2022

I recommend checking out clickhouse-obfuscator. It's a more sophisticated tool for dataset obfuscation.

Installation (single binary Linux/Mac/FreeBSD):

curl https://clickhouse.com/ | sh

./clickhouse obfuscator --help

Docs: https://clickhouse.com/docs/en/operations/utilities/clickhou...

evoxmusic · on July 11, 2022

I will take a look for Replibyte. Thanks for sharing

dopidopHN · on July 10, 2022

The default seems to be to store the sanitized dump on S3.

It’s not always available in a professional context. Or might be considered extraction.

Keeping everything local and detailing exactly what goes where and how would be helpful.

Svarto · on July 10, 2022

Also if it's possible to run everything without uploading it to S3. For a smaller time dev with projects in production I would find this really interesting for debugging the production database data, but in development. Uploading it and having it in S3 would needlessly complicate it for me (even though I can understand enterprise customers might prefer it that way)

evoxmusic · on July 10, 2022

You have a local storage option https://www.replibyte.com/docs/datastores#local-disk

CSSer · on July 10, 2022

I think the description in the man entry is better than the one in the README. Other than that, cool tool!