Anonymized Database Dump
The anonymize_dump make target produces a sanitized copy of a CARE database dump that is safe to share with collaborators or to use in research publications. The original database is never modified.
The dump contains all data belonging to users who gave consent (acceptDataSharing = true), with their personal information replaced by realistic pseudonyms. All data belonging to non-consenting users is removed. Auth secrets are wiped on every account so the dump cannot be used to log in.
Usage
make anonymize_dump \
CONTAINER=<container-name-or-id> \
DUMP=<filename-inside-db_dumps>
The output file is written to db_dumps/anonymized_<timestamp>.sql. The admin email is read from ADMIN_EMAIL in your .env file. You will be prompted to enter a new admin password interactively during the process.
Optional parameters:
SEED=<integer> — seed the Faker RNG so pseudonyms are reproducible across runs.
NUM=<integer> — after removing non-consenting users, randomly retain only NUM consenting users and remove the rest. Combine with SEED for a deterministic subset.
Pipeline steps
The target runs the following steps against a temporary sidecar database so the live care database is never touched.
Create sidecar DB — a fresh database is created inside the container.
Restore dump — the source dump is piped into the sidecar database.
Migrate schema — any pending Sequelize migrations are applied.
Anonymize —
Phase A — hard-delete all rows belonging to non-consenting users across every affected table, then reassign ownership of shared resources (projects, studies, documents, …) to the admin account.
Phase A.2 — safety net: any
userIdinteger column not covered by a Sequelize association is cleaned up automatically and a warning is logged.Phase B — replace PII on surviving users with Faker-generated values. Columns are read from
User.accessMapso future PII additions are picked up automatically.Phase C — set all auth-secret columns (
passwordHash,salt,resetToken, …) to'ANONYMIZED'on every account.
Reset admin password — the admin password is set interactively. The admin’s email and data are preserved throughout.
Export —
pg_dumpwrites the sidecar database todb_dumps/anonymized_<timestamp>.sql. Then all document files referenced by surviving records are collected from the localfiles/directory and zipped intodb_dumps/anonymized_<timestamp>_files.zip.Drop sidecar DB — the temporary database is removed.
Loading the dump
The target produces two output files in db_dumps/:
anonymized_<timestamp>.sql— the anonymized database dump.anonymized_<timestamp>_files.zip— the document files referenced by surviving records.
To load the dump into an existing CARE instance, place the SQL file in db_dumps/ and use the standard recovery command:
make recover_db CONTAINER=<container-name-or-id> DUMP=anonymized_<timestamp>.sql
If you need to regenerate the files archive from an existing anonymized SQL dump without re-running the full pipeline, use:
make export_dump_files CONTAINER=<container-name-or-id> DUMP=anonymized_<timestamp>.sql
Then extract the files archive into the files/ directory of the target instance:
unzip -o db_dumps/anonymized_<timestamp>_files.zip -d files/
Warning
recover_db will override the current database state of the target container.
What is kept
acceptDataSharingandacceptedAt— preserved as proof of consent.userName— replaced with a new randomly generated animal username.Study structure: workflows, steps, tag sets, templates, configurations.
Consenting users’ annotations, comments, sessions, and assignments.
Limitations
Warning
Free-text fields (annotation text, comment text, document content) may still contain personally identifiable information typed in by participants. A manual review of free-text content is recommended before sharing the dump externally.
The --num subset reduction picks users at random and does not guarantee that every retained user has a minimum number of annotations or sessions.
Keeping the pipeline up to date
The anonymize script uses Sequelize associations to locate every row belonging to a given user. When you add a migration that introduces a new table or column referencing user, the corresponding static associate() block must be added or updated so the pipeline can reach it.