Risks of "De-Identified" Therapy Transcripts

Part 1 of 3

De-identification can reduce privacy risk, but it does not make a therapy transcript anonymous. Safe Harbor removes certain identifiers, but many of those identifiers are not the things clients usually say out loud in therapy.


This is Part 1 of 3 in a series about the risks of retaining therapy session transcripts.


We've been seeing more EHR and AI scribe companies talk about retaining therapy session transcripts -- usually so they can train, test, or improve their AI. And when they explain how they're doing it "safely", a word that comes up a lot is de-identified.

Hearing that your therapy session transcript is de-identified certainly sounds reassuring! But... it is not a guarantee that the transcript can never be connected back to the real people, both the therapist and the client.

But let's rewind just a bit:

Therapy transcripts are extremely valuable to AI companies. They show how clients actually talk, how therapists actually respond, and how messy real human conversations can be.

That makes transcripts useful for training AI. And for testing and fine-tuning the AI models that power these mental health platforms.

Which is exactly why so many companies want to hold onto them. And this is exactly why "we de-identify the data" keeps showing up in various privacy explanations.

But what does "de-identification" actually mean?

One of the most common HIPAA de-identification approaches is called "Safe Harbor". It requires removing 18 specific categories of identifiers -- things like names, addresses, phone numbers, email addresses, Social Security numbers, medical record numbers, IP addresses, license plate numbers, and so on. (The full list is at the bottom of this post if you want to see it.)

Those identifiers matter, obviously. A therapy transcript should never be stored with someone's name, phone number, address, or medical record number attached.

But here's the mismatch with therapy... And when you realize this too, you'll hit your head against your desk just like we did.

Therapy clients don't usually say their Social Security number out loud.

Look at that Safe Harbor list again. Most of those identifiers are things clients don't say in a therapy session. They're not casually mentioning their license plate or IP address. Heck, they probably rarely reference their full name during a session!

Those identifiers make a lot of sense when you're looking to de-identify billing records, claims data, and other structured health records. But they make much less sense for a transcript of a 50-minute conversation.

A session transcript can have all these obvious identifiers removed and it can still include enough context that:

  • The client would absolutely recognize themselves.
  • Someone close to the client might recognize them too.
  • Someone with access to public information might be able to narrow it down.

Imagine a transcript that mentions the client is a teacher in a small school district, their spouse owns a local business, their child was involved in a local news event, and a custody hearing is coming up next month.

None of that is a Social Security number. None of it is a license plate. It would pass a Safe Harbor checklist just fine.

But isn't that still a transcript that could be traced back to the individual? You don't need to be Sherlock Holmes to connect some of the dots that are mentioned in (what should be) the very private space of a therapy session.

De-identification can absolutely reduce risk. It can remove obvious identifiers. It's not pointless.

But with therapy transcripts specifically, the most sensitive information is often the story itself. And the story is exactly what makes the transcript valuable to the company holding it.

Would you be okay if someone got ahold of your own personal "de-identified" therapy session transcripts?

Would you be okay if they were used to train future AI models?


Appendix: The 18 Safe Harbor Identifiers

Under HIPAA's Safe Harbor method, these identifiers of the individual (or of the individual's relatives, employers, or household members) must be removed:

  1. Names
  2. All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their equivalent geocodes [...]
  3. All elements of dates, except year, for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates, including year, indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older
  4. Telephone numbers
  5. Fax numbers
  6. Email addresses
  7. Social security numbers
  8. Medical record numbers
  9. Health plan beneficiary numbers
  10. Account numbers
  11. Certificate/license numbers
  12. Vehicle identifiers and serial numbers, including license plate numbers
  13. Device identifiers and serial numbers
  14. Web Universal Resource Locators (URLs)
  15. Internet Protocol (IP) addresses
  16. Biometric identifiers, including finger and voice prints
  17. Full-face photographs and any comparable images
  18. Any other unique identifying number, characteristic, or code [...]

(Source for the Safe Harbor identifiers: HHS.gov)

Quill Therapy Solutions

What is Quill?

Quill streamlines progress notes for therapists, saving time by generating notes from a verbal or typed session summary. With privacy at its core, Quill never records client sessions, protecting the therapist-client relationship and avoiding ethical and confidentiality risks. Just record a summary, click a button, and Quill generates your notes for you.

Try Quill for free today, no credit card required. And for unlimited notes (and other types of therapy documentation), it's only $20/month. (Even less for teams.)

Try Quill and save time on notes.