Wednesday, August 17, 2016

I Hope Some Serious Experts Have Examined These Plans Carefully. It Is Important They Do!

This appeared a little while ago:

Linkable de-identified 10% sample of Medicare Benefits Schedule (MBS) and Pharmaceutical Benefits Schedule (PBS)

This data is a collection of the current and historical use of Medicare and PBS services. This data release contains approximately 1 billion lines of data relating to approximately 3 million Australians. The data sets have been designed to enable other datasets to be linked in the future, for example hospital data, immunisation data. The addition of these data sets will greatly increase the amount of data and open new areas of analysis.
A suite of confidentiality measures including encryption, perturbation and exclusion of rare events has been applied to safeguard personal health information and ensure that patients and providers cannot be re-identified.

Confidentialisation Methodology

All Medicare and PBS claims for a random 10% sample of patients are included in the release. To be clear, it is a 10% sample of patients, not a 10% sample of Medicare or PBS claiming activity for the selected patients. Although the data held by the Department does not contain identifiers such as individual patient names, a number of steps have been taken to further protect the confidentiality of the released data.

ID number encryption

  • Patient ID Numbers (PIN) are encrypted using the original PIN as the seed.
  • Provider ID numbers are encrypted using the original ID number as the seed.

Data adjustments

  • Only the patient’s year of birth is given, not the date of birth.
  • Date of service and date of supply are randomly perturbed to ±14 days of the true date.
  • Geographic aggregation:
Provider State is derived by the Department of Health by mapping the provider's postcode to State. The states are then collapsed to ACT and NSW, Victoria and Tasmania, NT and SA, QLD, and WA. This is not the Servicing Provider State which is supplied from the Department of Human Services.
  • Rate event exclusion: Medicare and PBS items with extremely low service volumes have been removed.
Here is the page:
There has been a lot of discussion of re-identification of individual data from providing access to data sets like this:
Here is a useful link which discusses some of the issues:

The Debate Over ‘Re-Identification’ Of Health Information: What Do We Risk?

August 10, 2012
Dateline: May 18, 1996 – The collapse and attack. Massachusetts Governor William Weld wasn’t feeling well under his commencement cap and gown. He was about to receive an honorary doctorate from Bentley College and give their keynote graduation address. But, unbeknownst to him, he would instead make a critical contribution to the privacy of our health information. As he stepped forward to the podium, it wasn’t what Weld said that now protects your health privacy, but rather what he did: He teetered and collapsed unconscious before a shocked audience.
Weld recovered quickly and the incident might have passed quietly but for an MIT graduate student. Latanya Sweeney’s studies had brought to her attention hospital data released to researchers by the Massachusetts Group Insurance Commission (GIC) for the purpose of improving healthcare and controlling costs. Federal Trade Commission Senior Privacy Adviser Paul Ohm provides a gripping account of Sweeney’s now famous re-identification of Weld’s hospitalization data using voter list information in his 2010 paper “Broken Promises of Privacy.”
It would be difficult to overstate the influence of the Weld voter list attack on health privacy policy in the United States – it had a direct impact on the development of the de-identification provisions in the HIPAA Privacy rule. However, careful examination of the demographics in Cambridge, MA at the time of the re-identification attempt indicates that Weld was most likely re-identifiable only because he was a public figure who experienced a highly publicized hospitalization rather than there being any actual certainty about the accuracy of his attempted re-identification using the Cambridge voter data.
The Cambridge population was nearly 100,000 and the voter list contained only 54,000 of these residents, so the voter linkage could not provide sufficient evidence to allege any definitive re-identification. Because the logic underlying re-identification depends critically on being able to demonstrate that a person within a health data set is the only person in the larger population who has a set of combined “quasi-identifier” characteristics that could potentially re-identify them, re-identification attempts face a strong challenge in being able to create a complete and accurate population register. Furthermore, the same methodological flaws that undermined the certainty of the Weld re-identification continue to create far-reaching systemic challenges for all re-identification attempts – a fact which must be understood by public policy-makers seeking to realistically assess current privacy risks posed by HIPAA de-identified data. (The full details of these technical issues for re-identification risk assessment are available in a more lengthy review.)
With the benefit of hindsight, it is apparent that the Weld/Cambridge re-identification has served as an important illustration of privacy risks that were not adequately controlled prior to the 2003 HIPAA Privacy Rule. Still, a broader policy debate continues to rage between some voices, like Ohm, alleging that computer scientists can re-identify individuals hidden in anonymized data with “astonishing ease,” and others who view de-identified data as an essential foundation for a host of envisioned advances under healthcare reform.
Nowhere is this tension more evident within the health policy arena than in the recent proposal by the Office of the National Coordinator for Health Information Technology (ONC) for standards, services, and policies enabling secure health information exchange over the Internet to support the Nationwide Health Information Network (NwHIN). Motivated by concern that perceived re-identification risks could “undermine trust”, ONC proposes that de-identified health information could not be used or disclosed for any commercial purpose, a policy which would be certain to unleash a Pandora’s box of unintended consequences. Yet ONC also broadcasts their skepticism regarding purported re-identification risks by noting that they have been “somewhat exaggerated”.
Because a vast array of healthcare improvements and medical research critically depend on de-identified health information, the essential public policy challenge then is to accurately assess the current state of privacy protections for de-identified data, and properly balance both risks and benefits to maximum effect.
Lots more here:
The steps taken to protect the individual identities seem reasonable to the non-expert but with such a large data set one wonders just what statistical tricks might be possible to re-identify some data.
I would like to hear some expert views on what risk(s) are being run here.


Bernard Robertson-Dunn said...

IMHO, there is a big difference between data de-identification alone and de-identification and obfuscation together.

The mega dataset released by ABS has had de-identification and obfuscation applied.

De-identification is a case of removing personal identifying data from an individual record and leaving the rest intact.

Obfuscation changes the data, or adjusted, as they say.

In the case of the recent data release, ABS has obfuscated/adjusted the data by:

* Only the patient’s year of birth is given, not the date of birth

* Date of service and date of supply are randomly perturbed to ±14 days of the true date.

* Geographic aggregation: Provider State is derived by the Department of Health by mapping the provider's postcode to State. The states are then collapsed to ACT and NSW, Victoria and Tasmania, NT and SA, QLD, and WA.

In addition, (this isn't data adjustment, it's data removal)

Rate event exclusion: Medicare and PBS items with extremely low service volumes have been removed.

All of which means that there will be multiple identical (apart from maybe patient ID) records with the same data in them. A single patient may have multiple records with exactly the same data. Similarly, a patient may have multiple identical original records which have had their dates perturbed to make them different.

IMHO, the risk of identifying an individual person from that data is vanishingly small, even if other data sources are available.

Anyone wanting to find out about a specific individual would have much easier options. Bribing a Centrelink/MyHR call centre operator springs to mind, or their GP's medical practice nurse. The human engineering approaches are far more productive.

Peter said...

There are some very clever techniques to de-identify data if you have an expert involved. I recall hearing several years ago that it is possible to identify nearly 50% of people just through knowing the *postcode* where they work and the postcode where they live. Of course this relies on a secondary source of information to match the record against, so is only useful if you are looking for a specific individual.
There is a whole discipline in business intelligence and data security which focuses on PII (personally identifiable information) and what fields, individually or in combination, can be used to uniquely identify a record. Bruce Schneier has written several books on the topic.
Note that matching records when you want to, or need to, can also be difficult. Companies with more than one customer database spend lots of money to link records in one with data in another. De-duplication is not an easy problem.

On the topic of social engineering to find a record: I recall 15 years ago when the privacy legislation was introduced. We were working on Telstra's call centre app at the time and needed to add audit tracking for every screen seen by the staff which may have private information on it. This did not prevent them looking since they needed to do so as part of their work. Instead it recorded what they had searched and seen so that unusual patterns could be identified later, and any customer concerns could be investigated. In other words, cure rather than prevention made more sense in the context. Today the pattern analysis could be automated and a flag raised if e.g. a specialist searched for a famous person who did not attend their clinic.

Bernard Robertson-Dunn said...

Peter, you said "There are some very clever techniques to de-identify data if you have an expert involved." did you mean re-identify?

Some of the downsides of de-identification e,g it can stuff up research

NCVHS Hearing: De-identification and HIPAA
May 24, 2016
Improving HIPAA De-identification Public Policy
Daniel C. Barth-Jones, M.P.H., Ph.D.
Assistant Professor of Clinical Epidemiology,
Mailman School of Public Health
Columbia University

Peter said...

Oops, yes - "re-identify".

Agreed - many of our clients using business intelligence and data analytics are looking at customer behaviour. One key factor is segmentation which is based, one way or another, on demographics. But demographic information is exactly the part that make it easiest to re-identify and so they are often kept out of anonymised data-sets.