Good Data Analysis | Machine Learning | Google for Developers (2024)

Author: Patrick Riley

Special thanks to: Diane Tang, Rehan Khan, Elizabeth Tucker, Amir Najmi,Hilary Hutchinson, Joel Darnauer, Dale Neal, Aner Ben-Artzi, Sanders Kleinfeld,David Westbrook, and Barry Rosenberg.

History

Last Major Update: Jun. 2019
An earlier version of some of this material appeared on the Unofficial Google Data Science Blog: Oct. 2016

Overview

Deriving truth and insight from a pile of data is a powerful but error-pronejob. The best data analysts and data-minded engineers develop a reputationfor making credible pronouncements from data. But what are they doing thatgives them credibility? I often hear adjectives like careful andmethodical, but what do the most careful and methodical analysts actually do?

This is not a trivial question, especially given the type of data that weregularly gather at Google. Not only do we typically work with very largedata sets, but those data sets are extremely rich. That is, each row ofdata typically has many, many attributes. When you combine this with thetemporal sequences of events for a given user, there are an enormous numberof ways of looking at the data. Contrast this with a typical academicpsychology experiment where it's trivial for the researcher to look atevery single data point. The problems posed by our large, high-dimensionaldata sets are very different from those encountered throughout most of thehistory of scientific work.

This document summarizes the ideas and techniques that careful, methodicalanalysts use on large, high-dimensional data sets. Although this documentfocuses on data from logs and experimental analysis, many of thesetechniques are more widely applicable.

The remainder of the document comprises three sections covering differentaspects of data analysis:

Technical: Ideas and techniques on manipulating and examiningyour data.
Process: Recommendations on how you approach your data, whatquestions to ask, and what things to check.
Mindset: How to work with others and communicate insights.

Technical

Let's look at some techniques for examining your data.

Look at your distributions

Most practitioners use summary metrics (for example, mean, median,standard deviation, and so on) to communicate about distributions.However, you should usually examine much richer distributionrepresentations by generating histograms, cumulative distribution functions (CDFs), Quantile-Quantile (Q-Q) plots, and so on.These richer representations allow you to detect important featuresof the data, such as multimodal behavior or a significant class of outliers.

Consider the outliers

Examine outliers carefully because they can be canaries in the coal minethat indicate more fundamental problems with your analysis. It's fine toexclude outliers from your data or to lump them together into an"unusual" category, but you should make sure that you know why data ended upin that category.

For example, looking at the queries with the lowest number of clicks mayreveal clicks on elements that you are failing to count. Looking at querieswith the highest number of clicks may reveal clicks you should not becounting. On the other hand, there may be some outliers you will never beable to explain, so you need to be careful in how much time you devoteto this task.

Consider noise

Randomness exists and will fool us. Some people think, “Google has somuch data; the noise goes away.” This simply isn’t true. Every numberor summary of data that you produce should have an accompanying notionof your confidence in this estimate (through measures such asconfidence intervals and p-values).

Look at examples

Anytime you are producing new analysis code, you need to look atexamples from the underlying data and how your code is interpretingthose examples. It’s nearly impossible to produce working code of anycomplexity without performing this step. Your analysis is abstractingaway many details from the underlying data to produce useful summaries.By looking at the full complexity of individual examples, you can gainconfidence that your summarization is reasonable.

How you sample these examples is important:

If you are classifying the underlying data, look at examplesbelonging to each class.
If it's a bigger class, look at more samples.
If you are computing a number (for example, page load time), makesure that you look at extreme examples (fastest and slowest 5% perhaps;you do know what your distribution looks like, right?) as well aspoints throughout the space of measurements.

Slice your data

Slicing means separating your data into subgroups and looking atmetric values for each subgroup separately. We commonly slice alongdimensions like browser, locale, domain, device type, and so on. If theunderlying phenomenon is likely to work differently across subgroups, youmust slice the data to confirm whether that is indeed the case. Even ifyou do not expect slicing to produce different results, looking at a fewslices for internal consistency gives you greater confidence that youare measuring the right thing. In some cases, a particular slice may havebad data, a broken user interaction, or in some way be fundamentally different.

Anytime you slice data to compare two groups (such as experiment vs. control,or even “time A” vs. “time B” ), you need to be aware of mix shifts.A mix shift is when the amount of data in the slices for each group isdifferent. Simpson's paradoxand other confusions can result. Generally, if the relative amount of data ina slice is the same across your two groups, you can safely make a comparison.

Consider practical significance

With a large volume of data, it can be tempting to focus solely on statisticalsignificance or to hone in on the details of every bit of data. But you needto ask yourself, "Even if it is true that value X is 0.1% more than value Y,does it matter?" This can be especially important if you are unable tounderstand/categorize part of your data. If you are unable to make sense ofsome user-agent strings in your logs, whether it represents 0.1% or 10% ofthe data makes a big difference in how much you should investigate those cases.

Check for consistency over time

You should almost always try slicing data by units of time becausemany disturbances to underlying data happen as our systems evolve overtime. (We often use days, but other units of time may also be useful.)During the initial launch of a feature or new data collection, practitionersoften carefully check that everything is working as expected. However,many breakages or unexpected behavior can arise over time.

Just because a particular day or set of days is an outlier does not meanyou should discard the corresponding data. Use the data as a hookto determine a causal reason why that day or days is different before youdiscard it.

Looking at day-over-day data also gives you a sense of the variation in thedata that would eventually lead to confidence intervals or claims ofstatistical significance. This should not generally replace rigorousconfidence-interval calculation, but often with large changes you cansee they will be statistically significant just from the day-over-day graphs.

Acknowledge and count your filtering

Almost every large data analysis starts by filtering data in various stages.Maybe you want to consider only US users, or web searches, or searches withads. Whatever the case, you must:

Acknowledge and clearly specify what filtering you are doing.
Count the amount of data being filtered at each step.

Often the best way to do the latter is to compute all your metrics, evenfor the population you are excluding. You can then look at that data toanswer questions like, "What fraction of queries did spam filtering remove?"(Depending on why you are filtering, that type of analysis may not alwaysbe possible.)

Ratios should have clear numerator and denominators

Most interesting metrics are ratios of underlying measures. Oftentimes,interesting filtering or other data choices are hidden in the precisedefinitions of the numerator and denominator. For example, which ofthe following does “Queries / User” actually mean?

Queries / Users with a Query
Queries / Users who visited Google today
Queries / Users with an active account (yes, I would have todefine active)

Being really clear here can avoid confusion for yourself and others.

Another special case is metrics that can be computed only on some of your data.For example "Time to Click" typically means "Time to Click given that therewas a click." Any time you are looking at a metric like this, you need toacknowledge that filtering and look for a shift in filtering between groupsyou are comparing.

Process

This section contains recommendations on how to approach your data, whatquestions to ask about your data, and what to check.

Separate Validation, Description, and Evaluation

I think of data analysis as having three interrelated stages:

Validation¹: Do I believe the data is self-consistent,that it was collected correctly, and that it represents what Ithink it does?
Description: What's the objective interpretation of this data?For example, "Users make fewer queries classified as X," "In theexperiment group, the time between X and Y is 1% larger," and"Fewer users go to the next page of results."
Evaluation: Given the description, does the data tell us thatsomething good is happening for the user, for Google, or for the world?

By separating these stages, you can more easily reach agreement with others.Description should be things that everyone can agree on for the data.Evaluation is likely to spur much more debate. If you do not separateDescription and Evaluation, you are much more likely to only see theinterpretation of the data that you are hoping to see. Further, Evaluationtends to be much harder because establishing the normative value of a metric,typically through rigorous comparisons with other features and metrics,takes significant investment.

These stages do not progress linearly. As you explore the data, you mayjump back and forth between the stages, but at any time you should beclear what stage you are in.

Confirm experiment and data collection setup

Before looking at any data, make sure you understand the context in whichthe data was collected. If the data comes from an experiment, look at theconfiguration of the experiment. If it's from new client instrumentation,make sure you have at least a rough understanding of how the data iscollected. You may spot unusual/bad configurations or population restrictions(such as valid data only for Chrome). Anything notable here may help youbuild and verify theories later. Some things to consider:

If the experiment is running, try it out yourself. If you can't, atleast look through screenshots/descriptions of behavior.
Check whether there was anything unusual about the time range theexperiment ran over (holidays, big launches, etc.).
Determine which user populations were subjected to the experiment.

Check for what shouldn't change

As part of the "Validation" stage, before actually answering the questionyou are interested in (for example, "Did adding a picture of a faceincrease or decrease clicks?"), rule out any other variability in thedata that might affect the experiment. For example:

Did the number of users change?
Did the right number of affected queries show up in all my subgroups?
Did error rates change?

These questions are sensible both for experiment/control comparisons andwhen examining trends over time.

Standard first, custom second

When looking at new features and new data, it's particularly tempting to jumpright into the metrics that are new or special for this new feature. However,you should always look at standard metrics first, even if you expect them tochange. For example, when adding a new universal block to the page, make sureyou understand the impact on standard metrics like “clicks on web results”before diving into the custom metrics about this new result.

Standard metrics are much better validated and more likely to be correct thancustom metrics. If your custom metrics don’t make sense with your standardmetrics, your custom metrics are likely wrong.

Measure twice, or more

Especially if you are trying to capture a new phenomenon, try to measure thesame underlying thing in multiple ways. Then, determine whether these multiplemeasurements are consistent. By using multiple measurements, you can identifybugs in measurement or logging code, unexpected features of the underlyingdata, or filtering steps that are important. It’s even better if you can usedifferent data sources for the measurements.

Check for reproducibility

Both slicing and consistency over time are particular examples ofchecking for reproducibility. If a phenomenon is important and meaningful,you should see it across different user populations and time. But verifyingreproducibility means more than performing these two checks. If you arebuilding models of the data, you want those models to be stable acrosssmall perturbations in the underlying data. Using different time rangesor random sub-samples of your data will also tell you howreliable/reproducible this model is.

If a model is not reproducible, you are probably not capturing somethingfundamental about the underlying process that produced the data.

Check for consistency with past measurements

Often you will be calculating a metric that is similar to things thathave been counted in the past. You should compare your metrics to metricsreported in the past, even if these measurements are on differentuser populations.

For example, if you are looking at query traffic on a special populationand you measure that the mean page load time is 5 seconds, but pastanalyses on all users gave a mean page load time of 2 seconds, thenyou need to investigate. Your number may be right for this population,but now you have to do more work to validate this.

You do not need to get exact agreement, but you should be in the sameballpark. If you are not, assume that you are wrong until you can fullyconvince yourself. Most surprising data will turn out to be an error,not a fabulous new insight.

New metrics should be applied to old data/features first

If you create new metrics (possibly by gathering a novel data source) andtry to learn something new, you won’t know if your new metric is right.With new metrics, you should first apply them to a known feature or data.For example, if you have a new metric for user satisfaction, you shouldmake sure it tells you your best features help satisfaction. If you havea new metric for where users are directing their attention to the page,make sure it matches to what we know from looking at eye-tracking or raterstudies about how images affect page attention. Doing this providesvalidation when you then go to learn something new.

Make hypotheses and look for evidence

Typically, data analysis for a complex problem is iterative.²You will discover anomalies, trends, or other features of the data.Naturally, you will develop theories to explain this data. Don’t justdevelop a theory and proclaim it to be true. Look for evidence (insideor outside the data) to confirm/deny this theory. For example:

If you see something that looks like a learning trend, see if itmanifests most strongly with high frequency users.
If you believe an anomaly is due to the launch of some features,make sure that the population the feature launched to is the onlyone affected by the anomaly. Alternatively, make sure that themagnitude of the change is consistent with the expectations of the launch.
If you see growth rates of users change in a locale, try to find anexternal source that validates that user-population change rate.

Good data analysis will have a story to tell. To make sure it’s the rightstory, you need to tell the story to yourself, then look for evidence thatit’s wrong. One way of doing this is to ask yourself, “What experimentswould I run that would validate/invalidate the story I am telling?” Evenif you don’t/can’t do these experiments, it may give you ideas on howto validate with the data that you do have.

The good news is that these theories and possible experiments may leadto new lines of inquiry that transcend trying to learn about any particularfeature or data. You then enter the realm of understanding not just thisdata, but deriving new metrics and techniques for all kinds offuture analyses.

Exploratory analysis benefits from end-to-end iteration

When doing exploratory analysis, perform as many iterations of thewhole analysis as possible. Typically you will have multiple stepsof signal gathering, processing, modeling, etc. If you spend toolong getting the very first stage of your initial signals perfect,you are missing out on opportunities to do more iterations in the sameamount of time. Further, when you finally look at your data at the end,you may make discoveries that change your direction. Therefore, yourinitial focus should not be on perfection but on getting somethingreasonable all the way through. Leave notes for yourself and acknowledgethings like filtering steps and unparseable or unusual requests, butdon't waste time trying to get rid of them all at the beginningof exploratory analysis.

Watch out for feedback

We typically define various metrics around user success. For example,did users click on a result? If you then feed that data back to thesystem (which we actually do in a number of places), you create lotsof opportunities for evaluation confusion.

You can not use the metric that is fed back to your system as a basisfor evaluating your change. If you show more ads that get more clicks,you can not use “more clicks” as a basis for deciding that users arehappier, even though “more clicks” often means “happier.” Further,you should not even do slicing on the variables that you fed back andmanipulated, as that will result in mix shifts that will be difficultor impossible to understand.

Mindset

This section describes how to work with others and communicate insights.

Data analysis starts with questions, not data or a technique

There’s always a motivation to analyze data. Formulating your needs asquestions or hypotheses helps ensure that you are gathering the datayou should be gathering and that you are thinking about the possiblegaps in the data. Of course, the questions you ask should evolve asyou look at the data. However, analysis without a question will endup aimless.

Avoid the trap of finding some favorite technique and then onlyfinding the parts of problems that this technique works on. Again,creating clear questions will help you avoid this trap.

Be both skeptic and champion

As you work with data, you must become both the champion of theinsights you are gaining and a skeptic of them. You will hopefullyfind some interesting phenomena in the data you look at. Whenyou detect an interesting phenomenon, ask yourself the following questions:

What other data could I gather to show how awesome this is?
What could I find that would invalidate this?”

Especially in cases where you are doing analysis for someone whor*ally wants a particular answer (for example, "My feature is awesome!"),you must play the skeptic to avoid making errors.

Correlation != Causation

When making theories about data, we often want to assert that"X causes Y"—for example, "the page getting slower caused users toclick less." Even xkcd knows that youcan not simply establish causation because of correlation. Byconsidering how you would validate a theory of causation, you canusually develop a good sense of how credible a causal theory is.

Sometimes, people try to hold on to a correlation as meaningfulby asserting that even if there is no causal relationship betweenA and B, there must be something underlying the coincidence so thatone signal can be a good indicator or proxy for the other. This areais dangerous for multiple hypothesis testing problems; asxkcd also knows, given enough experiments andenough dimensions, some of the signals will align for a specificexperiment. This does not imply that the same signals will align inthe future, so you have the same obligation to consider a causal theorysuch as “there is a hidden effect C that causes both A and B” so thatyou can try to validate how plausible this is.

A data analyst must often navigate these causal questions for thepeople that want to consume the data. You should be clear with thoseconsumers what you can and can not say about causality.

The previous points suggested some ways to get yourself to do theright kinds of soundness checking and validation. But sharing with apeer is one of the best ways to force yourself to do all these things.A skilled peer can provide qualitatively different feedbackthan the consumers of your data can, especially since consumersgenerally have an agenda. Peers are useful at multiple points throughthe analysis. Early on you can find out about gotchas your peer knowsabout, suggestions for things to measure, and past research in this area.Near the end, peers are very good at pointing out oddities, inconsistencies,or other confusions.

Ideally, you should get feedback from a peer who knows something about thedata you are looking at, but even a peer with just general data-analysisexperience is extremely valuable.

Expect and accept ignorance and mistakes

There are many limits to what we can learn from data. Nate Silver makes astrong case in The Signal and the Noisethat only by admitting the limits of our certainty can we make advances inbetter prediction. Admitting ignorance is a strength not usually immediatelyrewarded. It feels bad at the time, but it’s a great benefit to you and yourteam in the long term. It feels even worse when you make a mistake anddiscover it later (or even too late!), but proactively owning up to yourmistakes earns you respect. That respect translates into credibilityand impact.

Closing thoughts

Much of the work to do good data analysis is not immediately apparentto the consumers of your analysis. The fact that you carefully checkedpopulation sizes and validated that the effect was consistent acrossbrowsers will probably not reach the awareness of the people trying to makedecisions from this data. This also explains why good data analysis takeslonger than it seems it should to most people (especially when they only seethe final output). Part of our job as analysts is to gradually educateconsumers of data-based insights on what these steps are and why theyare important.

The need for all these manipulations and explorations of your data alsolays out the requirements for a good data analysis language and environment.We have many tools available to us to examine data. Different tools andlanguages are better suited to various techniques discussed above; picking theright tool is an important skill for an analyst. You should not be limited bythe capabilities of the tool you are most comfortable with; your job is toprovide true insight, not apply a particular tool.

This is sometimes called “initial data analysis.” See the wikipedia article on data analysis↩
Technically, it should only be iterative if you are doing exploratory analysis, not confirmatory analysis.↩