20 September 2023
Dude where’s my data? GA4 data retention
- GA4 uses identifiers such as client ID to create aggregated data to populate standard reports
- Data retention impacts your reporting flexibility for date ranges beyond your data retention window
- Exploration reports and some standard reports are limited to data within the set data retention period as these do not always use aggregated data
Importance of data retention
In our recent article: GA4 setup challenges: 10 key mistakes to avoid, the number 1 item on the list we covered was data retention in Google Analytics 4 (GA4).
The reason we gave it so much emphasis is because Universal Analytics has now been switched off life support with data processing to stop for users on the free offering.
You’ll want to ensure that GA4 is collecting as much useful data from as early on as possible, and unfortunately as of right now, Google’s own documentation, product updates and feature usability hasn’t really emphasised the importance of getting this configuration right (in our own humble opinion of course).
We’re hoping that by sharing these details with you now, we can save you much unnecessary hair pulling later down the line, avoiding the dread of discovering the data you thought you had been collecting isn’t there, leading to a genuine “Dude where’s my data?” moment.
What does the GA4 data retention setting actually do?
When it comes to data retention in GA4 there are actually multiple forms of data retention and options to be aware of.
Most of the default reports i.e the reports that live under the generic “Reports” menu in GA4, are based on aggregated (or pre-processed) data.
Reports* : mostly populated by aggregated data
Explore** : raw event data
So what does aggregate data mean?
When you visit a website with Google Analytics 4 implemented, you’ll be assigned a client ID. This ID helps GA4 answer questions such as but not limited to:
- How many pages you viewed (or page_view events you triggered)
- If you’ve been to the site previously (new vs returning user)
- If you reached a conversion point (triggered an event marked as conversion)
- How you arrived (campaign source)
You can think of the client ID as the glue that sticks all your on-site or in-app activity in a given session or over multiple sessions together.
When GA4 processes data to be stored for the reports section, it looks across multiple user sessions and begins a process of tallying up metrics against specific events and associated dimensions for those users into daily summary tables.
For example: This is the default Engagement → Conversions report in GA4
|Event name||Conversions||Total users||Total revenue|
When the data is first being processed it’s necessary to use information such as the client ID to distinctly count total users. However, once this aggregated data is built, it’s actually no longer necessary to ‘preserve/persist’ the client ID information in order to display this report.
How do aggregates work?
You can think about aggregated data using a real world example. Say you wish to know the total number of cars registered and on the road in any given state of Australia. To do that, you would need to count up each distinct licence plate number for the entire state (where each unique licence plate ID is like the client ID in GA4). Once you’ve tallied up your total number of registered cars, you no longer need to store the underlying licence plate IDs.
In GA4, data retention works in a similar way, as per the documentation “The retention period applies to user-level and event-level data associated with cookies, user-identifiers (e.g., User-ID), and advertising identifiers (e.g., DoubleClick cookies, Android’s Advertising ID [AAID or AdID], Apple’s Identifier for Advertisers [IDFA]).”
Now let’s look at the GA4 default 2 month data retention period. If your GA4 property is configured to 2 months data retention, that doesn’t mean that data older than 2 months is automatically removed from your GA4 reports. It’s actually a bit more nuanced than that.
So what exactly happens to my data when retention expires?
Probably the best visual example of this, is to look at the GA4 explorer reports for a date range that is further back than your current retention period setting.
For example if your account has a default retention period of 2 months, go back to a period that is more than 2 months ago.
Here is the Louder account explorer report (and date range picker) screenshot below:
In our case we’ve gone back to June 2022, which is a period of time that is longer than our data retention setting at the time (14 months).
In this exploration report, we cannot select a date range earlier than the 24th of June 2022 (the date of the screenshot was 24 August 2023 - 14 months later).
Note: GA4 accounts can only select a maximum 14 month date range. However GA4 360 can go up to 50 months which we’ve since enabled on our Louder GA4 360 status account.
So our explore reports are limited to only 14 months, despite the fact that the Louder GA4 account has been collecting data as far back as 2021.
But you said data isn’t just dropped, what gives?
That’s true, but this is where the nuances come in. Explorer reports are highly customisable and therefore rely on recalculating data off of identifiers like client ID and event information (non aggregate data).
However, the reports section of GA4, which is based on aggregated data (pre-calculated), is still available and visible with some caveats.
Included below is an example that illustrates what happens when data is dropped by GA4 after the retention period in the core reports section of GA4.
This date range shows 27th Jan 2021 to 23rd August 2023. We can see the graph shows activity through the whole period (past the existing 14 month limit imposed by data retention), and this is because the graph is calculated or displayed based on aggregated data.
However note the exclamation notifications for the ‘New users’ and ‘Sessions’ by ‘default channel group’ cards? This indicates that only partial data is able to be shown, because these report cards need to use identifiers that are discarded with data retention settings. Therefore in the given reporting date range they are only able to return partial data for this period.
If we change the date range to be after the 24th of Jun 2023, the notifications will turn to a green checkmark indicating full data is available - since this period is within the 14 month retention period.
So in summary, if you have a limited retention window and want to use explorations for date ranges beyond the retention period you’ll be out of luck. If you look at the default aggregate reports (under reports) section and include a date range that expands beyond the retention period you may only get partial data for certain report cards in these reports (dependent on the metrics and dimensions in that card).
So what if you want to analyse historical data and drill deeper into that data to better understand your audience behaviour over time, what can you do?
Well that’s where integrations like the GA4 BigQuery connector come into play.
GA4 BigQuery connector
BigQuery allows you to export GA4 event level data to a Google Cloud BigQuery dataset and store the data for as long as you like. Sign up to our newsletter to receive the next article on how BigQuery can help you with your data retention.
Get in touch
Reach out to Louder to learn more and explore how you can use GA4 for all your data analysis needs!