The 4 Flavors of Untrustworthy Data

July 11, 20198 min read

For too many product teams, a data-first approach to decision-making is more of a soundbite than a reality. There are a number of reasons why data-first strategies never materialize to begin with. Lack of resources, lack of tooling, lack of direction, and treating data as a project rather than a foundation are all blockers to getting off the ground. But even when an organization gets all of these things right, one culprit can bring the whole thing down: untrustworthy data.

Untrustworthy data is the root of all evil when it comes to a data-driven product strategy. One of my favorite explanations of this phenomenon comes from Brian Balfour. He calls it the Data Wheel of Death. The basic premise is this: when data isn’t trustworthy, teams use the data less. When teams use data less, it gets de-prioritized and grows stale. When data grows stale, it becomes less trustworthy, and the cycle continues. (You can watch a video of Brian talking about this phenomenon here.) But what does “untrustworthy data” actually mean?

This post will explore the four types of untrustworthy data, specifically as it relates to behavioral product data, meaning data that tells you about users and the things they do within a website or application. At Heap, we have a large number of clients who have partnered with us specifically as a means for solving this problem.

Across all of the teams we’ve spoken to that have dealt with data woes, “untrustworthy data” falls into 4 categories:

Stale Data
Unclear Data
Inaccurate Data
No Data

Stale Data

Stale Data means data that is out of date, and no longer being collected.

There’s a sinking feeling that comes with running a report and seeing a flatline: a single line that holds steady at 0 for the entire time period in the report.

What do you do next?

Surely, this event did something at some point. Was it in reference to an old version of the feature that has been deprecated? Is the tracking code broken? Am I looking in the wrong place? It’s usually hard to tell why the data doesn’t exist, and the situation can be an annoying blocker to making a product decision with truth. All too often, we hear about PMs giving up on the data altogether, and making a decision based on gut feeling.

This problem grows worse when an analytics environment is riddled with a lot of stale events; users are less likely to get their hands dirty exploring the data if they keep hitting a flat “0” line.

The root cause of the stale data problem is typically lack of process around tracking product data. More specifically, this happens when there is no maintenance of old events and no effort put into updating a tracking plan as the product evolves.

In a paradigm that involves tracking code and manual instrumentation, it’s common that event tracking whack-a-mole takes precedent, as new features and use cases pop up, while the effort involved in cleaning data moves to the back burner.

Unclear Data

Raise your hand if you’ve dealt with an analytics environment that has multiple versions of the same data point. Something like “Signup”, “Sign Up”, “signup”, and “Signup – NEW” might all be separate events, even though they would seem to tell you the same thing.

Which one is the correct one? Are the other ones “right”, but tell some different version of the story? Even if you know which version is correct, how can you be sure your teammates do? Data points that don’t clearly refer to one specific thing are common in product analytics, and the above example is just one version of this tricky problem.

Unclear data tends to lead to two bad outcomes.

In one scenario, a user ends up analyzing an event that does not tell them what they think it does. This situation is obviously pretty nasty, and in the worst case scenario, leads to a decision being made with wrong data.
On the other hand, even in the best case scenario, the user may choose the “right” event, but only those close to the implementation typically have enough confidence in the event to actually use it. For everyone else, trust in the entire dataset is eroded, even if everything else is perfectly correct.

When events are manually instrumented, this is very hard to avoid. A rigid, code-generated dataset tends to be a hotbed for unclear event data, mostly because implementation of new events is managed through code, and by only a small number of people.

Any inconsistency, duplication, or poor naming convention starts as a quick decision, a simple mistake, or someone saying “good enough”, but the problems that result tend to grow in scope and sneakily infect an entire analytics environment over time.

Inaccurate Data

Inaccurate data is a problem that comes in 2 varieties:

Data that is inaccurate and you know it.
Data that is inaccurate, but you have no idea.

Inaccurate data of both kinds lead to the same two bad outcomes as the “unclear data” problem mentioned above: decrease in trust of the data and faulty conclusions based on misleading data.

Let’s start with the type of inaccurate data that is obvious. This is the lesser of two evils, but an annoying obstacle nonetheless. If you’re someone who has used an analytics solution in the past, chances are at some point you’ve looked at a report and thought something like “There’s no chance that only 30 people viewed our homepage last month” or on the other end of the spectrum, “Oh nice, it looks like everyone on Earth clicked our new call-to-action…twice.”

The depth of these inaccuracies is hard to quantify (Are the numbers just a little bit off, or am I looking at the wrong thing entirely?), but also typically hard to fix, especially when events are created by a small team of engineers operating on their own. Obviously inaccurate data, like unclear data, doesn’t always lead to wrong conclusions, but almost always leads to decreased adoption and usage of product data.

The second version of inaccurate data is the most formidable of all. When data is sneakily inaccurate, it can produce business decisions based in complete fallacy.

Imagine if you thought one call-to-action at the bottom of your app’s homepage was outperforming another similar one at the top of the page. You got this information from a report that clearly showed that the second CTA was more commonly clicked, so you decide to deprecate the top one.

You never find out, but the inverse was actually true – the top CTA was more effective than the second one. Maybe the event names got mixed up during implementation, or maybe the tracking code on the top CTA was flawed, and not every occurrence was logged.

The potential causes are many, but the scariest part is that you will probably never even know you were wrong.

No Data

Simply not having the data required to answer a business question is one of the most common challenges that teams run into when faced with a business question.

Too many times, an analytics implementation is treated as a one-time project with an end date. Early on, everyone expects that they’ll always have the necessary information for every question that will ever pop up once the project is complete. After all, you’ve spent time scoping the requirements, building a tracking plan, and working with engineering to implement it all.

Then you hit a brick wall – now what?

This is when the one-time implementation comes back to bite. It’s common to not have processes in place for instrumenting new events when gaps in the dataset inevitably pop up. Instead, teams end up in an endless cycle of event tracking whack-a-mole, where new requests get put into an engineering queue and eventually end up in a sprint. Once the new events get implemented, data still needs to build up, and by the time the question can be answered, it might not even be relevant anymore.

At this point, you might have a new set of questions that can’t be answered, and the cycle continues.

Addressing These Challenges

These challenges are tough and pervasive. So the question becomes: “How can any team manage to avoid the seemingly inevitable pitfalls of product analytics?” There are generally 3 approaches that teams take when they want to get ahead of these issues.

Putting resources and money towards preventing and fixing the problem
Spending a lot of time on the problem
Implementing a virtual dataset

Putting resources and money towards preventing and fixing the problem

For some large companies, the best approach is to staff the problem with lots of resources and money. For these organizations, there are hundreds of engineers and countless resources dedicated to maintaining a clean and complete dataset. When humans spend their days collecting, monitoring, cleaning, and analyzing the data at this scale, you typically end up with a pretty useful set of information. The reality is that, for the vast majority of companies, this approach isn’t feasible.

Spending a lot of time on the problem

The second approach makes sense for a broader set of companies, specifically those who aren’t able to staff hundreds of engineers on analytics. This approach consists of spending lots of time planning, implementing, and building processes for updating and cleaning the data. In this scenario, the intention is usually to focusing the entire company on customer data, but when fatigue sets in, it’s common to see other priorities take precedent.

Implementing a virtual dataset

Some teams choose to implement a virtual dataset, meaning they capture all event data up front, without any manual tracking code, and then later on, pick and choose which events they’d like to analyze. The benefit of a virtual dataset is that there are not nearly as many resources needed to track all of the information, and the time window from implementation to insight is much smaller. Plus, since the data is available retroactively, there’s no need to wait for it to build up before a question can be answered.

At Heap, we care about helping teams make business decisions with truth. This means that we spend a lot of time thinking about how to help teams avoid all flavors of untrustworthy data. If you’d like to hear more about how we can help you, reach out to us at sales@heap.io!