12 October 2023

Hashed PII is PII

Sand and mountains in the distance

In summary

Digital advertising platforms are increasingly encouraging their advertisers to send first party data on the basis that they will be able to better measure and optimise campaigns.
These platforms state that hashed data set by advertisers is secure and cannot be reversed but this may not necessarily be true.
Make an informed decision on whether or not sharing your customer’s data with these platforms is right for business

Transitioning to first party data

By now most digital marketers will be aware of significant changes in the way digital marketing is measured. Apple has been championing its intelligent tracking prevention (ITP) which has slowly eroded the effectiveness of cookies and other tracking techniques for many digital marketing use-cases including measurement and attribution.

In response to this, large digital advertising platforms such as Meta and Google (AdTech) are increasingly advocating the use of first party data to fill in the gaps.

This article explores the privacy implications of this shift to using first party data.

What is first party data?

First party data in this context refers to details about your customers (or website / app users). These details typically include phone numbers, email addresses, names, but can also include addresses, date of birth and gender in some cases.

Most of these data points fall under the definition of personally identifiable information (PII) which is covered by privacy legislation in many jurisdictions, for example the Australian Office of the Information Commissioner states ¹:

Personal information includes a broad range of information, or an opinion, that could identify an individual. What is personal information will vary, depending on whether a person can be identified or is reasonably identifiable in the circumstances.

For example, personal information may include:

an individual’s name, signature, address, phone number or date of birth

AdTech vendors’ position on hashing first party data

The AdTech vendors are keen to imply that the data shared with them is not PII because it is securely hashed using an industry standard SHA256 scheme:

From Meta ²: “Hashing is a type of cryptographic security method which turns the information in your customer list into randomised code. The process cannot be reversed.”
From Google ³: “The feature uses a secure one-way hashing algorithm called SHA256 on your first-party customer data…”
From Pinterest ⁴: “To maintain user privacy, you need to ensure all personalized information (i.e. email, phone number, etc.) is hashed using SHA256”
From TikTok ⁵: “As per standard industry practice, customer emails and phone numbers will be hashed with SHA256 before reaching TikTok servers for matching.”

What is hashing?

Hashing is the process of converting a string of text into a fixed length code, with one primary use case being efficient lookup of data.

From Wikipedia:

“Hash functions and their associated hash tables are used in data storage and retrieval applications to access data in a small and nearly constant time per retrieval.”

The SHA256 algorithm is a cryptographic hash which is, on the face of it, a secure and irreversible type of hash.

From Wikipedia:

“finding an input string that matches a given hash value … is unfeasible”

At first glance this sounds fine, we convert our users’ PII into hashed codes that cannot be converted back into PII before sharing with the AdTech vendors. However, the problem is the caveat from the same sentence in the above article:

“finding an input string that matches a given hash value … is unfeasible, assuming all input strings are equally likely.”

This is important because where there is a predictable set of possible values for a data point, then it is possible to construct a “hash table” that allows the hashed value to be easily reversed.

What is a hash table?

A hash table is a precomputed table of hashes (often used for cracking hashed passwords). If all, or even most, potential values are known for a hashed data point, a hash table can be used to look up the original pre-hashed value.

Taking a couple of examples:

Date of birth is usually sent as a hash of a number like 19960126 (for 26 Jan 1996), 45,000 distinct values would cover the birthday of everybody who is alive today.
Australian mobile phone numbers (which are ten digits long starting with 04) are normalised to include the country code so 0400 123 456 -> 61400123456 (or +61400123456). If you do the maths, there are 100 million possible values.

Creating hash tables for these is quick and straightforward and does not require any specialised hardware, for example, generating all (100 million) Australian mobile numbers took less than five minutes on an old MacBook.

Extending this to the other data points that are typically included with first party data:

Address - an authoritative list of the (approximately) 15 million postal addresses is available free ⁶.
First and last names - there is a publicly available dataset⁷ of 730K first names and 983K last names based on data from a Facebook (Meta) data leak of 533M
Email address - this may be a little more challenging, but Have I Been Pwned⁸ has data from more than 12 billion breached accounts (including duplicates), so this may not be too difficult to find.

What does all of this mean?

Considering the points above, it would not be that difficult at all for an AdTech vendor, or indeed any motivated third party, to decrypt hashed customer data and convert it back to the source PII.

While you are sending your data to AdTech vendors so that they can match to their users (who have seen your advert), what you may actually be sending them is in many cases details of all the people that have purchased your products or services, regardless of whether they are a user of that vendor’s services or not.

To reiterate, by using Meta’s CAPI, you might be sending the name, address, phone number, date of birth, email address plus the transaction number and value of one of your customers to Meta for a customer that has never used a Meta product before!

If the data is intercepted by a third party, it would not be that difficult for that third party to reverse the hashed data and obtain the personal information for your customers.

If you are using a trusted third party such as a customer data platform (CDP) to send data on your behalf, this potentially increases the risk.

Improving hash security: add salt and pepper!

There are a few existing ways to improve the security of hashing, we discuss the use of salt and pepper below.

It’s important to note though that none of the AdTech vendors currently support using either salt or pepper with their first party data uploads, so this is not an option open to advertisers today but definitely something we hope to see considered in the future.

What is salt?

One of the recommended ways to improve the security of cryptographic hashes is to use a salt. This works by adding additional random data to items being hashed and then saving this random data with the hashed data.

For example, if you had a phone number of 61400123456 and you generated a salt of Himalayan_, then you would actually apply the hashing algorithm to Himalayan_61400123456

You would use a different “salt” value for each user. While this does not make it impossible to reverse the hashed value back into PII, it does make it much more time intensive. Instead of computing the hash of all phone numbers once (the 5 minute effort above) and then doing a near instantaneous lookup to reverse a hashed phone number, you would need to regenerate the list of hashed phone numbers for each record (so five minutes for each one).

What is pepper?

Another similar technique, is referred to a “pepper” (or secret salt) which works on the same basis, except the pepper is a secret code that is added and only known to the person hashing the data and the person for which the data is intended. In the context of the Ad Tech vendors, an agreed “pepper” value could be defined per account. This value, unlike the salt value above, would not be transmitted with the data itself which would make things much more difficult for an untrusted third party to reverse the hashed data back to PII.

You can use both salt and pepper together (extending the above example, you could get Himalayan_61400123456_Red-hot chilli). Adopting both salt and pepper as an industry best practice would greatly reduce the risk that the data could be decoded by any third parties, or, at the very least, make it more expensive to decode.

Louder’s recommendation

Think before you rush into sharing hashed PII with your favourite AdTech vendor as there may be more risk than you think. Don’t take the comforting statements on “non-reversibility” these companies make at face value, it is easy to demonstrate these claims are actually false.

You can also petition your AdTech vendors to step up their game and support more secure options such as adding support for salts to the their first party data APIs.

We would not be surprised if the use of first party data in this way is challenged in the near future under one of the more strict privacy regimes such as Europe’s GDPR.

If you’re interested in all things data and privacy, get in touch with the team at Louder or check out some of our other articles. Sign up to our newsletter to keep up with industry updates and future exciting articles to come!

References:

[1]: https://www.oaic.gov.au/privacy/your-privacy-rights/your-personal-information/what-is-personal-information

[2]: https://en-gb.facebook.com/business/help/112061095610075

[3]: https://support.google.com/google-ads/answer/9888656

[4]: https://developers.pinterest.com/docs/conversions/updated/

[5]: https://ads.tiktok.com/help/article/advanced-matching-web

[6]: https://data.gov.au

[7]: https://github.com/philipperemy/name-dataset

[8]: http://haveibeenpwned.com

About Ian Kenney

Ian is a Consultant and Partner at Louder and has been working with data and analytics since it was invented. He enjoys all things code.