Some mental health EHR and AI Scribe companies are retaining therapy session transcripts. These transcripts can then be used to train and improve their AI functionality. And users (therapists and clients alike) are reassured with the idea that these transcripts will be “de-identified”.
What does “de-identified” mean?
Most people hear “de-identified” and assume it means there is no practical way to trace it back to the therapist and client.
But in the real world, what’s actually happening?
A common de-identifying method is HIPAA’s Safe Harbor. This method states that identifiers must be removed from the data. These identifiers include names, Social Security Numbers, phone numbers, fax numbers, license plate numbers, etc. (For a full list, you can read more about this at HHS.gov.)
At first glance, this seems so reasonable! And if you’re working with a standard healthcare-related form, perhaps it is! Imagine those forms you first fill out when you sit down at a doctor’s appointment. If all of those identifiers were swapped out, no one could reasonably figure out that it was actually you who had filled out that form. It’s a bunch of granular data, and the identifying parts have been removed.
But what about a therapy session transcript?
How often does a client mention their license plate number? Heck, how often does their full name even come up in conversation? How often do any of these identifiers get verbally mentioned in a session, leading to them getting recorded and transcribed into a document afterwards?
And so you can apply “de-identification” to a therapy session transcript and be left with something that is still very similar to the original. Is that okay? It doesn’t have any obvious identifiers, so maybe?
But wait… What about the other details that were mentioned by the client or therapist during the session? Maybe in isolation, they don’t identify anyone, but what if the combination of two or three could identify someone? Maybe a unique job that’s mentioned, plus a local event, plus an upcoming legal proceeding. Maybe it’s an accident that they were involved with that was reported in the local news. Or maybe a recent real estate transaction or health procedure. None of these are the obvious identifiers people usually think of, but they can still become identifying.
And so that’s where we’re at. “De-identifying” health documents is super important — and makes a ton of sense for many many things. But does it really work for therapy session transcripts? These other details are hard to reliably detect in narrative transcripts without gutting the usefulness of the data — the reason the company wants to retain it in the first place.
We already store progress notes about the session at my EHR, so what’s the big deal?
A progress note is intentionally limited. The therapist is trained to only include the important and relevant details, without doing any harm to the client if the note were to be subpoenaed, read aloud in court, get requested and read by the client, etc. “Less is more”, as they say. Meanwhile, a transcript is raw and full of details and direct quotes that were never meant to become a permanent record.
But wait! The mental health tech company said that they’d also “de-couple” this transcript data!
Honestly, the conversation so far has already assumed that the data was “de-coupled”. It would be pretty silly to de-identify this data and not de-couple it.
What does “de-couple” mean?
It means that these retained transcripts do not have a direct link to the original session or client or therapist or other record in the database. The transcript would exist in isolation. If it wasn’t de-coupled, you’d be one database query away from knowing which therapy session conversation this transcript came from, which is why emphasizing “de-coupling” is a little silly.
So let’s talk about consent.
Presumably, both the client and the therapist explicitly consented to having their therapy session transcripts retained. And remember: Consent only means something if therapists and clients understand the whole chain: that transcripts may be retained, used to train or improve AI, de-identified in limited ways, de-coupled in ways that may limit deletion, and potentially copied into backups, evaluation sets, model workflows, or downstream systems.
But what if they change their mind? What if they decide, wait, please delete all of those transcripts that were retained.
Oops! Because these transcripts are “de-coupled”, how could they be traced back to the client or therapist that is requesting that they be deleted? If the transcripts can actually be identified, were they truly de-coupled in the first place? And if the transcripts cannot be identified, does that mean that a client and therapist completely lose the ability to request that their data is deleted in the future? This is quite a predicament!
Even if the transcript files or records can be deleted, what if they were used to train the AI model? Training data is not just stored data. Once transcripts are used for training, fine-tuning, evaluation, or improvement of an AI model, deletion gets much murkier. Even if the original file is deleted, it may be hard to know what was learned, extracted, or incorporated elsewhere.
And who actually gets access to these transcripts? If transcripts are retained for AI functionality (specifically, training and improving AI models), are they shared with model providers, analytics tools, storage vendors, QA systems, labeling vendors, or other downstream vendors? Are they handled with the same care as PHI data?
But what’s the big deal? Who cares about my transcripts in the first place? The company says they’re HIPAA compliant, so I should be good, right?
“HIPAA compliant” does not mean that the security of your data is guaranteed. Hundreds of data breaches happen to HIPAA compliant companies every year. HIPAA spells out what to do in the event of a data breach, the responsibility to notify the impacted users, etc.
What happens if there’s a data breach with these “de-identified” and “de-coupled” transcripts? Will users be notified? Will all users of the service have to be notified, since the originating users can’t be located due to de-coupling? Does the fact that these transcripts are de-identified and de-coupled remove the tech company’s responsibility to notify users of the data breach? Would the company argue the breached data was no longer PHI? These are all questions worth asking the tech companies who are retaining the transcripts.
Data breaches happen. They can happen because of malicious actors (hackers with Cheeto dust on their faces like you see in the movies) or, probably more likely, because of an innocent accident or oversight by somebody within the organization (or a third-party service that they use). Even with the best procedures in place, accidents happen. Over a long enough timeline, breaches and mistakes are not hypothetical. This is why it’s so important to consider where you send your data and how that data is handled, retained, etc. The less data, the better!
So when there is inevitably a data breach with one of these companies that are storing these transcripts, how bad is it? Honestly, it really comes down to how much data was breached. Was it just the transcripts? Was it additional data too? Partial data sets of both? We can’t predict the future!
In the event of a data breach, access to this data could lead to “re-identification”…
What is “re-identification”?
Basically, it’s the reverse of “de-identification”. You could technically take this data that doesn’t have direct identifiers and tie it back to the individuals involved, using other elements in the data set to figure it out.
Here are some basic ways this could happen:
- Triangulation of other specific details: Use other details and references in the transcript like mentioned previously — current events, unique job references, a recent death that was mentioned, etc. There are details on the internet that, when combined with the transcript data, could help triangulate who said it. In isolation, these details are not “identifiers”, but when combined, especially with outside data, they can possibly lead to identifying the individuals.
- Speech patterns: Tie multiple transcripts to the same therapist and/or client through the use of speech patterns or other details. And then combine details across multiple sessions to get enough details to triangulate and identify the person. Think of how many sessions a therapist has each week, and the large transcript data set that would then exist — surely there are some speech patterns and frequently-used phrases that are somewhat specific to that therapist.
- Timestamps and other metadata: Records in databases are typically stored with the date and time that they were created — also called a “timestamp”. Even if these transcripts are “de-coupled” from the original database records, do they have timestamps? Could these be tied back to the original session, using the session’s start and end times? Was there any other secondary metadata that just happened to be stored on both sets of records, allowing them to be analyzed and ultimately linked together, even if the link is not immediately obvious?
- Progress notes: If progress notes are also involved in the data breach, think of the possibilities. Progress notes have just a small summarized sliver of the details that are mentioned in a therapy session. But couldn’t they be aligned with the session transcripts, considering there’s a one-to-one relationship between transcript and progress note, where the former results in the latter? In an EHR, a progress note then has a direct link back to the client and therapist, so if a data breach involves these sets of data, this is not that far-fetched.
But who would bother doing all of this complicated work? Oh, humans are not doing this. Computers are. Software is. More specifically, AI is. AI is super good at identifying patterns and tying bits of data together. And it’s relatively cheap — and can work 24/7, 365 days a year. And software is great at working with large data sets just like this. There’s no need to hire the next Sherlock Holmes. Or 100 Sherlock Holmes. You just need access to a few computers.
Okay, so say there’s a data breach. And say someone gets ahold of all of my therapy session transcripts. Who cares?
Maybe you live your life as an open book! Everything is out there, and there’s nothing private ever shared in your therapy sessions — or in your transcripts. If so, that’s great. Congrats. But that’s probably not the case for most.
Maybe someone shared doubts about their marriage. Frustrations with their kids or parents. Concerns about their immigration status. Decisions regarding women’s health. Questions about their sexuality. Complaints about their boss and coworkers. Confessions of past crimes. These sorts of details would typically have been vaguely referenced in a progress note — with the finer details and the client’s exact phrasing left in the session, only to be heard by the therapist. With the advent of some companies storing therapy session transcripts, now they have the potential to be read and analyzed by others.
None of these details should be embarrassing. Folks should be free to share any and all of these details in therapy! But if they got into the wrong hands, couldn’t they be abused? Government misuse, legal action, marital and custody disputes, subpoenas, or blackmail… Once this data is out there, you can’t rein it back in.
So we have to think about the future too.
Tech companies can make promises today and change their terms tomorrow. They can get acquired or go bankrupt tomorrow. The reassurances made today about this data does not guarantee its safety and responsible-handling tomorrow.
Technology is only getting better. AI is only getting better. Working through large data sets like this would have required a lot of effort just a few years ago. That’s no longer the case now. And now AI is writing a lot of code — and it’s also identifying a lot of security gaps in software too. It’s a double-edged sword. AI is equally productive at working on “good” things and working on “bad” things… It’s just a matter of who is operating it.
If you have a data set of session transcripts that is declared “de-identified” and “de-coupled”, and even with the most intense of efforts, it can’t be traced back to the original sources today… Will that be true tomorrow? Or a year from now?
When there is a data breach at one of these tech companies, it will haunt the clients for years to come. Which is why we must be cautious about what data we choose to create in the first place.
So what do you propose we do?
If you’re a therapist or a therapy client or a mental health EHR or AI Scribe, please do not store therapy session transcripts, de-identified or otherwise. It’s not worth sacrificing the current and future privacy of the individuals involved.
Therapy depends on clients knowing that the room is private.