*“Learnings of online customer behaviour for Make Benefit Glorious Product Development”*

# Background

At Redbubble, millions of users browse through independent artwork in the pursuit of that elusive unique art, which they want to own on a high quality product like a t-shirt, cushion cover or phone case. At any point there are lots of ideas to improve the customer experience, most of which are unquantified, backed by lots of unvalidated assumptions. From a product development point of view, this is understandable. All great ideas have to be seeded on some assumptions. However, we strive to validate assumptions, so we can get a better idea of impact and opportunity size. The real questions we want to answer are:

- What are the biggest problems?
- What are the best opportunities?
- Are we building the right thing?

There are already plenty of techniques to do this. User interviews and surveys give us valuable qualitative feedback, but come with their own drawbacks: interpretation of user feedback (want vs. need) is difficult, time consuming and expensive. Quantitative analysis tools like Google Analytics, flurry, RJmetrics etc. provide a high level view, which is a valuable starting point, but they fail to isolate the effect of external factors on a particular analysis.

We decided that we wanted to get confident insights from real user actions, draw inferences and quantify the relative impact of specific user visit attributes, and all quickly and cheaply. This led us to consider statistical modelling of customer journey data. We have been dabbling with statistical analysis of big consumer data sets, so we decided to take the plunge and see what we discovered.

# Quantifying customer journeys

We have done some statistical analysis before, so we knew that before any study, we have to come up with our strongest hypotheses. This exercise is very important as it keeps us focussed and honest for the duration of study. There are lots of articles on how to come up with hypotheses, but at the end of the day, it comes down to hunches, backed by some domain knowledge.

## Step 1. Hypothesis Generation

Let’s say we come up with the following hypotheses:

- Hypothesis A: “Users looking at multiple search result pages without clicking through are very engaged and successful at finding artwork of choice”
- Hypothesis B: “Users navigating to a listing from search results are having a good experience and successful at finding artwork of choice”

## Step 2. User Journey Attribution

Now that we have decided on our best hypotheses, let’s work out which user journeys can be attributed to support them. It’s not meant to be exclusive – a user journey can be attributed to multiple hypotheses. **Now here’s the clever part:** most tools map journeys in steps or URLs visited on site, which is a cumbersome and computationally expensive. We simplified it by building signatures for each journey.

### User journey signature

A user (uuid:10002994877588) visits the home page, then navigates to the page showcasing major product categories, followed by doing a search, pagination, clicking through to a product listing page and finally clicking back to search page. If we visualize a web log with just one user, it would look something like this:

User ID |
Path & query |

10002994877588 | /home |

10002994877588 | /products |

10002994877588 | /search?q=term |

10002994877588 | /search?q=term&p=2 |

10002994877588 | /product-listing?ref=search |

10002994877588 | /search?q=term&p=2 |

Table 1. Example of server logs for a user identified by a user ID

Attributing a set of rows to a journey is difficult and computationally expensive, as it involves forward and backward lookups, but attributing a single row is very easy. For this example, if we give one letter identifier to each page visit or user interaction, we can easily reduce this user visit to a signature.

User ID |
Path & query |
Signature element |

10002994877588 | /home | H |

10002994877588 | /product-categories | C |

10002994877588 | /search?q=term | S |

10002994877588 | /search?q=term&p=2 | P |

10002994877588 | /product-listing?ref=search | L |

10002994877588 | /cart | A |

10002994877588 | /search?q=term&p=2 | P |

Table 2. Each page to be observed in study can be given a single character identifier

**Signature of visit: HCSPLAP**. That was easy! Now we simply run through our logs and we get a nice one-liner for each user journey.

UUID |
Signature |

10002994877589 | SSHSPSLLASS |

10002994877590 | SSLSSLLL |

Table 3. Example of signature representing user journeys

### Hypothesis signature

Similar to user journey signature, we build out our hypothesis signature. This is much quicker and easier, but we need domain knowledge to craft a signature which mimics the hypothesized user journey most accurately.

- Hypothesis A:
`SSPPPSPPSSS`

- Hypothesis B:
`SPLLLSPPL`

Now comes the fun twist: to successfully attribute these signatures to user journey signatures, we transform the hypothesis signatures into regular expressions. Yay for regular expressions! :-)

- Hypothesis A: /(S)\1{0,}(P*)/
- Hypothesis B: /(S)\1{0,}(P*)(L+)/

### Attribution

Now we analyse our reduced dataset to identify which hypothetical journeys are applicable to a particular user journey.

UUID |
Signature |
HypothesisA |
HypothesisB |

10002994877589 | SSHSPSLLASS | 3 | 1 |

10002994877590 | SSLSSLLL | 2 | 2 |

10002994877591 | /search?q=term&p=2 | P | |

10002994877592 | /product-listing?ref=search | L | |

10002994877593 | /search?q=term&p=2 | P |

Table 4. Columns 3 & 4 display the number of time a hypothesis signature matched with user journey

To keep the study a bit open, let’s start by building counts and later we can use binary values. As we mentioned before, hypothetical journeys will not be exclusive, they overlap and that’s OK.

### Data preparation

Now that we have a sense of direction and clear idea of how we want to prepare dataset, we could crank those beasts and produce gigabytes worth of data? No. At this point we want to focus on quality of data, which means to identifying and removing bad rows and looking for outliers. So let’s start with a small sample – about 10,000 rows. Once we have a repeatable way of identifying and removing bad rows and outliers, we can run it for a bigger dataset to draw inferences with more confidence.

### Data Visualization

Always visualize your data. It helps you to find anomalies and pick up obvious trends in the data. A simple histogram will tell you a lot about how your data is distributed, which can help in modelling it better. Scatter plots are good for identifying outliers.

# Deriving Relationships

Coming back to the primary intent of this study, we want to find any relationship that may exist between the hypothetical user journey and our definition of success. Before we do that, we should shed some light on the subtleties of the task ahead. Firstly, our definition of success should allow us to quantify success in a very confident manner. This should be something concrete: either an actual business success metric or something that correlates strongly to it. Secondly, when looking at the relationship itself, we are not interested in predicting the success metric based on hypothetical journey(will this journey result in success?); rather we want to understand the relationship or infer (which journeys tend to be successful?). This allows us to choose the correct statistical methods: explanatory modeling.

## Defining success, a.k.a. the dependent variable

To be confident about any relationships we discover, we need to quantify our definition of success. In this case, we deem an “Add to Cart” event as the signal that the customer succeeded in their journey. A transaction is a more “real” success metric, but in the e-commerce world, ‘Add to Cart” is a good proxy, since it is correlated to transactions. It also lets us use shorter journeys to describe our scenario, which simplifies our data collection. This could be a binary value, but we’ll start with the count.

## Analysis

We have prepared our dataset with a row for each customer journey, attributed hypothesis and attributed success flags. Now we can run our regression analysis. Regression analysis lets us explain outcomes very well, so it is well suited to situations such as this, where our intent is to derive inferences.

First, we choose a regression model. You can start with simple regression to surface scan and then proceed to other models like Poisson, Logit etc., depending on the nature of your dataset. There is plenty of excellent literature to be found on the matter of choosing the correct model for your dataset.

In our case we apply a logistic regression model on our dataset. Using R, it can be done with something as simple as:

1 2 3 4 5 |
glm( formula = addToCart ~ HypothesisA + HypothesisB, family = binomial("logit"), data = data.df ) |

The output can be confusing, but for the sake of inference, we should focus on

- Direction of slope, shown by the positive or negative value of the Estimate.
- Amount of significance. Ignore anything which is not significant. Accept that there may be no result.

In our case:

Independent Variables |
Estimate |
Significance |

HypothesisA | -0.28925 | *** |

HypothesisB | 0.34065 | ** |

Table 5. Output from regression analysis.

Now we are in a position to say confidently, that:

**Hypothesis A is invalid**. We see a negative slope, which shows that customers going from a search result page to another page without clicking through to a listing are less likely to be successful at finding artwork that they like.**Hypothesis B is valid**. Customers going from a search result page to a listing are more likely to be successful at finding artwork that they like.

# Conclusion

We explained how we can quantify and attribute customer journeys to our hypothesis and validate them by applying efficient statistical models. This allows us to make inferences quickly, meaning we can shorten our feedback loop and be more responsive to the data.

Now we can confidently identify which user segments present a real opportunity to make improvements. Further, we can quantify the size of the opportunity simply by finding out how many customers qualify for a given segment. Knowing problem size helps in prioritizing. These steps are elementary for building out a data driven product portfolio and identifying our biggest opportunities.