This is Part 1 of 3 in a series about the risks of retaining therapy session transcripts.
We've been seeing more EHR and AI scribe companies talk about retaining therapy session transcripts -- usually so they can train, test, or improve their AI. And when they explain how they're doing it "safely", a word that comes up a lot is de-identified.
Hearing that your therapy session transcript is de-identified certainly sounds reassuring! But... it is not a guarantee that the transcript can never be connected back to the real people, both the therapist and the client.
But let's rewind just a bit:
Therapy transcripts are extremely valuable to AI companies. They show how clients actually talk, how therapists actually respond, and how messy real human conversations can be.
That makes transcripts useful for training AI. And for testing and fine-tuning the AI models that power these mental health platforms.
Which is exactly why so many companies want to hold onto them. And this is exactly why "we de-identify the data" keeps showing up in various privacy explanations.
But what does "de-identification" actually mean?
One of the most common HIPAA de-identification approaches is called "Safe Harbor". It requires removing 18 specific categories of identifiers -- things like names, addresses, phone numbers, email addresses, Social Security numbers, medical record numbers, IP addresses, license plate numbers, and so on. (The full list is at the bottom of this post if you want to see it.)
Those identifiers matter, obviously. A therapy transcript should never be stored with someone's name, phone number, address, or medical record number attached.
But here's the mismatch with therapy... And when you realize this too, you'll hit your head against your desk just like we did.
Therapy clients don't usually say their Social Security number out loud.
Look at that Safe Harbor list again. Most of those identifiers are things clients don't say in a therapy session. They're not casually mentioning their license plate or IP address. Heck, they probably rarely reference their full name during a session!
Those identifiers make a lot of sense when you're looking to de-identify billing records, claims data, and other structured health records. But they make much less sense for a transcript of a 50-minute conversation.
A session transcript can have all these obvious identifiers removed and it can still include enough context that:
- The client would absolutely recognize themselves.
- Someone close to the client might recognize them too.
- Someone with access to public information might be able to narrow it down.
Imagine a transcript that mentions the client is a teacher in a small school district, their spouse owns a local business, their child was involved in a local news event, and a custody hearing is coming up next month.
None of that is a Social Security number. None of it is a license plate. It would pass a Safe Harbor checklist just fine.
But isn't that still a transcript that could be traced back to the individual? You don't need to be Sherlock Holmes to connect some of the dots that are mentioned in (what should be) the very private space of a therapy session.
De-identification can absolutely reduce risk. It can remove obvious identifiers. It's not pointless.
But with therapy transcripts specifically, the most sensitive information is often the story itself. And the story is exactly what makes the transcript valuable to the company holding it.
Would you be okay if someone got ahold of your own personal "de-identified" therapy session transcripts?
Would you be okay if they were used to train future AI models?
Appendix: The 18 Safe Harbor Identifiers
Under HIPAA's Safe Harbor method, these identifiers of the individual (or of the individual's relatives, employers, or household members) must be removed:
- Names
- All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes [...]
- All elements of dates, except year, for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates, including year, indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
- Telephone numbers
- Fax numbers
- Email addresses
- Social security numbers
- Medical record numbers
- Health plan beneficiary numbers
- Account numbers
- Certificate/license numbers
- Vehicle identifiers and serial numbers, including license plate numbers
- Device identifiers and serial numbers
- Web Universal Resource Locators (URLs)
- Internet Protocol (IP) addresses
- Biometric identifiers, including finger and voice prints
- Full-face photographs and any comparable images
- Any other unique identifying number, characteristic, or code [...]
(Source for the Safe Harbor identifiers: HHS.gov)