Anamap Blog

Frontend Data Collection vs Data Warehouse Ingestion

Business Insight

8/19/2024

Alex Schlee

Founder & CEO

What is Frontend Data Collection and what is Data Warehouse Ingestion?

Put simply, Frontend Data Collection refers to processing and collecting data on the service or client machine where the analytics events are generated. The most common instance of this is the JavaScript that runs for a site’s web analytics. The actual JavaScript itself runs on the device that is accessing the website.

Data Warehouse Ingestion refers to the process of pulling data from other server locations (typically) such as an S3 bucket, a production database, CMS, etc. There are many examples of this but a fairly common process is to take a separate table of data about your marketing campaigns (such as from Facebook, Google, etc) and pull it into your data warehouse for the purpose of creating cost per acquisition reporting.

In many cases, data collected during Frontend Data Collection ultimately ends up in your company Data Warehouse. This can often lead to a discussion about whether certain data should be collected on the frontend or whether it should be ingested directly into the Data Warehouse. To anchor the discussion, one such example where this debate comes up is collecting product information on a product details page view. You can collect the information about the product a user has viewed, like the product name, in the page view event itself or you can store the product ID. When the page view event data lands in the Data Warehouse you can use that product ID to ingest the information about the product from your product database.

I'm going to cover the pros and cons of each option to help you better decide what the correct choice for your company and user case is.

Frontend Data Collection

What are the benefits?

Point-in-time data: the data collected and attached to the event truly represents what the user saw. If your store team briefly goofed on the title of a product in the store you'll be able to see that in the data.
Distributed processing: all of the data collection code is running on the client machine which means the processing can be distributed amongst many different computers. Each computer only contributes a few milliseconds of processing time but it saves your cloud resources a lot of compute time in the end.
Downstream availability: because frontend data collection is as far upstream as possible it means whatever data you collect should be available to any additional downstream platforms you choose to send the data to whether that's a CDP, CRM, additional analytics platform, or even a data warehouse.
Simplicity: data that can be collected by the frontend is usually more accessible and requires less work to collect.

What are the drawbacks?

Requires a code release (maybe): unless you're doing all your frontend data collection with a tag manager updating your data collection will likely require a release of some kind with associated QA and processes surrounding release.
Harder to update data after collection: most platforms allow some degree of data management to help fix minor errors in the data. However, fixing the data is much harder after it is already collected and in downstream systems.
Typically no access to private data: frontend data collection is happening in the wild on client machines which typically means whatever data is available to pass into your data platform is available for anyone to snoop. For security reasons frontends only expose data that is safe to have publicly accessible which means any data collection on the frontend is limited to that information.

Data Warehouse Ingestion

What are the benefits?

Post-hoc updates: because most of your data warehouse ingestion is going to be reliant on keys to join the data it also means you can change the data that you're joining onto those keys fairly easily.
Access to non-public data: one major benefit to data warehouse ingestion is that you're able to enrich your events with any non-public data you have. While the frontend is limited to just public data for security reasons your data warehouse can pull in any sensitive data because that data is not exposed to the public.

What are the drawbacks?

Data may not match what a user saw: Ingestion processes tend to be batches for efficiency but that also means the data in the ingested source has time to change and deviate from what a user actually experienced. This may mean the data isn't truly representative of a user's experience.
Centralized processing: the ingestion is all happening on your servers which means you will incur all of the compute cost and data transfer cost; doing too much ingestion can drive up your cloud computing bills.
Data only available in Data Warehouse: if you're using a visualization tool such as Amplitude, Adobe Analytics, Google Analytics, etc that tool won't have access to the full suite of information that is updated on the Data Warehouse side which means the reporting has to be done separately. Additionally, if you have a CDP or similar it means this enriched data cannot be used for creating audiences and is harder to use for triggering other CRM events.
Complexity: setting up the timed tasks and writing the appropriate ETL processes can be more complicated than the equivalent work on the frontend.

Summary

I hope this guide will give you a deeper understanding of your options for how to enrich your data. The next time this question comes up you can be more thoughtful in your planning. Whether you collect the data via Frontend Data Collection or Data Warehouse Ingestion, Anamap can help you map your data so analysts and stakeholders can more easily use the data to generate better insights.

Want to stay up to date with our latest blog posts?

ABOUT THE AUTHOR

Alex Schlee

Founder & CEO

Alex Schlee is the founder of Anamap and has experience spanning the full gamut of analytics from implementation engineering to warehousing and insight generation. He's a great person to connect with about anything related to analytics or technology.