“Business is a doggy dog world. And I am a shark, who eats doggy dogs.”
-Michael Scott
Lessons of business and life seem to emanate from the walls of a mid-sized paper-supply company in Scranton, Pennsylvania. Juggling office romances, navigating mergers, and wondering what Creed actually does every day are just some of the responsibilities for the employees of Dunder Mifflin. And when it comes to business operations, Regional President Michael Scott, a seasoned sales executive himself, certainly knows the critical relationship between marketing and sales, for he is in fact the world’s best boss.
Yet for all the camera time and plot lines dedicated to the sales team at Dunder Mifflin, the mention of a single marketing employee — let alone a whole team — is hard to come by. Even the most detailed The Office fan pages are silent on the topic.
Facing an increasingly competitive landscape of one-click e-commerce solutions and big-box retailers offering steep discounts, Dunder Mifflin must find a creative way to drive growth and keep their Scranton branch afloat.
Impromptu customer meetings at Chilis and hands-on sales training just aren’t going to cut it anymore. Instead, Michael and the sales team need answers and they need them now. If only there were a data-savvy tool a marketing team could build to help them…
A sales lead is a person or business that can eventually become a customer. Leads can be obtained through a variety of targeted marketing campaigns and serve as the proverbial fuel for a well-run sales engine.
Dunder Mifflin is no exception; in fact, a season 6 episode emphasized the importance of new leads to the sales team while also giving a glimpse into Erin and Andy’s blossoming romance.
Converting a lead into a customer is easier said than done. It is where the true skill of a salesperson shines. But not all leads must be treated equally. In fact, any marketer looking to drive real business impact should empower his or her colleagues in sales with tools and resources to sift through the noise of sometimes lengthy lead lists and help improve the highly scrutinized lead-to-customer conversion rate.
What if we could use data to help Jim, Dwight, Phyllis, Andy, and the rest of the Dunder Mifflin sales team become more efficient in their outreach? Before even picking up the phone and pitching a prospect on the wonders of groundbreaking paper products, they could know the likelihood of the individual on the other end of the line becoming a customer.
That’s exactly what lead scoring does; it’s an evaluation tool that attempts to rank and “score” leads based on their potential value to an organization.
Figure: Lead Scoring (image by author)
Factors like job title, company location, website visits, and many more can all be combined and weighted to help determine an appropriate score for a lead. An accurate lead scoring model increases sales efficiency by identifying and ranking leads most likely to convert while also helping team members de-prioritize ones that are not.
Such a tool would allow the Scranton branch to cut costs, close more deals, and leave time for more important office pranks or beet farm maintenance.
For this exercise, we’ll simulate a mock dataset of 800 total leads, 200 of which are current Dunder Mifflin customers. While it’s always preferable to use a real-life dataset, my efforts to hack into Dwight’s Salesforce.com account were unsuccessful, so I opted to create a fake dataset (plenty of tools exist for such an exercise) for two primary reasons. First, it seemed like every tutorial or lesson I came across was consistently using the same old “telco customer churn” dataset.
Second and more importantly, the process and methodology outlined below on building a lead scoring model through logistic regression serve as the primary focus and scope of this post.
My hope is that understanding the relationship between the characteristics of a lead and the probability of that lead becoming a customer can be applied to any dataset where logistic regression could unlock business insight…or at the very least get you some well-deserved props at next year’s Dundies.
Our dataset will feature a combination of first-party firmographic (CompanySize, State) and behavioral (WebsiteVisits, EmailOpens) variables that could reasonably be collected via a website form or appended to a record using data enrichment tools commonly found in a modern B2B MarTech stack.
The dependent variable of interest is the binary Customer column, labeled with either a ‘Yes’ or a ‘No’ to indicate if the lead eventually became a customer. We’ll import the particular packages from Python — a programming language commonly used for data analysis — necessary for our efforts and then read the data into a Pandas data frame using the read_csv method.
Doing so will allow us to get a better sense of the data we’re working with, especially if we use the head method to preview the first five rows of our data.
Figure: Previewing the first five observations or the “head” of our leads dataset (Image by author)
A key benefit of tools like Python, R, or even Excel lies in their ability to help analysts quickly develop a better understanding of the data at their disposal. Through proper data analysis, we can start to identify trends or outliers in our data which will help to inform the subsequent steps of building a predictive model.
We can run the describe method which will create summary statistics for each numerical variable and help us understand key statistical features of each including minimum values, mean values, different percentile values, and maximum values. We can quickly find the averages and ranges of certain variables of interest. For example, the average of EmailClicks is roughly half of EmailOpens.
Figure: An output of the .describe method on the data (image by author)
But to answer our main question of “among all leads, what’s the difference between those that become customers and those that do not become customers” we’ll have to use the Pandas groupby method, which allows us to aggregate or split our data to understand it more effectively.
Figure: “Grouping” the numerical variables based on our dependent variable of Customer (image by author)
Here, we’re starting to develop key insights that might inform our predictive model. Upon first glance, it would seem that leads who eventually become customers have a higher number of email clicks/opens and website visits but come from smaller companies when compared to non-converting leads.
But is that really true? Let’s dive deeper into one of these variables, EmailOpens. A box plot analysis is a great tool to help add visual context to the above outputs.
Figure: A boxplot showing the difference in EmailOpens based on the dependent variable (image by author)
The groupby method is also valuable for evaluating categorical variables. LeadSource is one such variable that indicates which particular marketing campaign the lead originated from. This could be essential information not only for the predictive nature of our model, but also to help the Dunder Mifflin marketing team justify which types of campaigns they should allocate more time and budget towards in the future.
A bar chart of the campaigns grouped by the dependent variable of whether or not they’re a customer initially indicates that Adwords (the search engine advertising platform developed by Google also referred to as Google Ads) and Webinar campaigns seem to be more effective in converting leads to customers; however, a key component of campaign costs is missing from this analysis and thus prevents us from drawing any final, definitive conclusions.
Figure: A bar chart comparing the various marketing campaigns in terms of generating leads and converting them to customers (image by author)
Data preparation is like a fire drill or first aid training in an office: you might not enjoy it, but when the time comes, you appreciate its value. Because we’re dealing with a mock dataset, we don’t have to be as wary of outliers, incorrect, or missing data; nevertheless, plenty of steps can be taken to put ourselves in a good position heading into the model-building stages. Simpler models are almost always preferred to their more complex counterparts, so we can drop any irrelevant columns or variables that we know contain no predictive power, like phone number and ID.
(While hypothetically the area code of a phone number could be used to inform the location of the lead, let’s assume that we’re covering that with the State column for this exercise).
An approach of “the simpler the better” can not only apply to feature selection but also to feature engineering.
Assuming a close relationship between the email opens and clicks of a lead and that both variables have a similar impact on the eventual conversion of a lead, we could sum these two variables into an EmailScore column, generating a single index indicating how much a lead is interacting with sales or marketing emails they get from Dunder Mifflin.
This would allow us to replace those two variables with the new EmailScore column which we’ll do using the code below. Moreover, this type of feature engineering helps us remove the risk of multicollinearity when building and evaluating our model.
Figure: The resulting “head” of our dataset after initial data cleanup (image by author)
As we observed from our EDA section, not all variables in our dataset are numeric. For example, LeadSource is a nominal (meaning no inherent order exists among the values) categorical variable indicating which campaign the lead originated from, Adwords, Print, Tradeshow, or Webinar. This is potentially critical information for our lead scoring efforts, so to ensure we can include it in our model we’ll utilize a technique called One-Hot Encoding. This approach creates an array of binary “dummy” variables, one for each possible LeadSource. For example, with the lead seen in row 3 (that originated from ‘Print’) we would have a value of ‘1’ in the new ‘Print’ dummy variable, and a ‘0’ in the other three newly encoded categorical dummy variables. These new dummy columns, which we’ll merge with our original dataset can now be utilized in our predictive model to potentially assess whether any predictive relationship exists between LeadSource and the lead converting into a customer.
Figure: The resulting “head” of our dataset after creating dummy variables for LeadSource (image by author)
We’ll also perform the same steps on the categorical State variable. Lastly, and perhaps most importantly, we need to transform our binary, categorical Customer variable into a numerical variable by swapping a value of 1 for ‘Yes’ and 0 for ‘No’. This can be done with the replace method in Pandas. Now, we’re all set to build a lead scoring model with the appropriate algorithm!
While many approaches exist when it comes to building predictive models, we’ll be utilizing logistic regression, a widely used classification algorithm. Similar to linear regression, logistic regression relies upon an equation where various input values (x) are combined and analyzed using weights or coefficient values to predict an output value (y). However, with logistic regression, the output value of interest is not continuous and is instead a discrete variable. Specifically, with binary logistic regression, the output is dichotomous, meaning only two outcomes (often represented as 0 or 1) are possible.
Logistic regression gives us the ability to model the probability — ranging from 0 to 1 — of what we’re interested in evaluating. Essentially, the outcome of probability p is the probability that our datapoint can be classified as a “1” instead of a “0”. The figure below, a scatterplot of EmailOpens variable mapped to the classifying Customer variable shows how difficult it would be to fit a straight line through our data points. Instead, we’ve added an S-shaped curve that never fully reaches 0 (No) or 1(Yes) to help explain how logistic regression can be a more appropriate approach with our data.
Figure: Scatterplot highlighting the nature of logistic regression (image by author)
For a more concrete example, with housing data we would use linear regression to predict the market value of a house being $500,000; while logistic regression would be more suited to predict a 0.72 percent chance (ranging from 0 to 1) of the house being sold. We won’t spend too much time delving into the statistical elements of logistic regression, but see here, here, and here for some great resources on the topic.
For this exercise, it’s just good to understand that the algorithm is a suitable tool for our efforts of predicting whether or not a lead will become a customer. Logistic regression — also sometimes referred to as logit model — is fairly easy to implement and interpret, which has led to its widespread adoption. However, logistic regression is not always able to handle complex datasets containing a large number of categorical variables. In such scenarios, other classification techniques such as support vector machines, random forests, or neural networks may be more appropriate tools.
We’ll be training a binary logistic regression model, enabling us to estimate the probabilities of a lead becoming a customer for Dunder Mifflin based on multiple factors like company size and website visits. To get started, let’s import the logistic regression model and necessary packages or methods from Scikit-learn (commonly referred to as sklearn), a commonly used machine learning library for Python. We’ll also create our target variable, y (Customer) and input variables (every other column), X from the leads dataframe we’ve prepared.
Let’s now split our data into two sets; a “training” set to train the model, and a “test” set to evaluate the model's performance. This can be done by using the function train_test_split(). Here, the data is being split into two parts, and the test_size parameter of the method allows us to dictate that 80% of the data will be used for model training and 20% for model testing. The size of your training and testing sets can influence the performance of your model, for example, models learn better when they have more training data. However, there’s a risk that they overfit the training data and don’t generalize well to new data, so in order to properly evaluate the model’s ability to generalize, you need enough testing data. 80/20 is a good balance to use for this split. This method returns four variables:
We’ll fit our logistic regression classifier to the training data and labels as it tries to understand the relationship between the two. Next, we’ll have the newly created model make predictions on the test data that it has never seen before. This process helps ensure the external validation of our model. We’ll show the prediction results via a 0 (not a customer) or 1 (customer). The default or standard threshold for a binary classifier model is 0.5, so if the value in the ‘1’ column on the right is greater than 0.5, the observation, or lead will be predicted as becoming a customer; if it is less than 0.5, then it will not become a customer according to the model, resulting in a 0 for that variable.
Figure: The model's predictions on our dependent variable for the first five rows of test data (image by author)
Now, when we compute the model accuracy on the entire test data using the score method from sklearn, we get an output of 0.94. Accuracy, one of the metrics for evaluating classification models, is defined simply as the number of correct predictions divided by the total number of predictions.
While it’s nice to see an accuracy rate of 94%, this doesn’t paint a complete picture. As noted above, our dataset of 800 total leads includes 200 customers and 600 non-customers; this is an example of an imbalanced dataset, where the labels for the two classes on the dependent variable possess significantly different frequencies.
An imbalanced dataset exposes the primary limitation of relying on accuracy as the sole evaluation metric for our model. For example, if we trained a simple model to predict every lead it saw from our dataset as a non-customer, it would have a pretty decent accuracy of 0.75; however, the model would be ignoring and incorrectly predicting all 200 customers as non-customers, thus defeating the purpose of the lead scoring efforts in the first place.
Figure: Output of the above Python code showcasing the imbalanced class of our dataset (figure by author)
Since an imbalanced class renders “accuracy” less useful, we need more nuanced tools like a confusion matrix to properly evaluate our model. The confusion_matrix method takes in two arguments, the actual labels of the test set (y_test) and your predicted labels (y_predict). The output is a 2x2 array describing the performance of the model by showing the frequency of correct and incorrect predictions segmented into four different groups (True Positives, False Positives, True Negatives, and True Positives).
Figure: A Confusion Matrix with descriptions of each quadrant (image by author)
Figure: The output of our model’s Confusion Matrix (image by author)
Some Python code will help convert this array into a more intuitive visualization. From this matrix, we can dig deeper into the errors and performance of our model. Note, the format and arrangement of a confusion matrix is not universal, so be sure to take note of the formatting during interpretation.
Figure: An easier to interpret version of our Model’s Confusion Matrix (image by author)
We can even grab a subset of our test dataset and compare the values between our model’s predictions for the class value with the real value to get a sense of where some errors — indicated by a difference between the two values — might have occurred.
Figure: Another way to view the discrepancies between the model’s predicted value and the true value (image by author)
Let’s define and calculate some additional evaluation metrics we can use from our confusion matrix output to get a better sense of our model’s performance and room for improvement.
Figure: An output of the various evaluation metrics of our model (image by author)
A receiver operating characteristic (ROC) curve is a graph showing the performance of a classification model based on two key parameters; the True Positive Rate or TPR (calculated the same as recall) and the False Positive Rate or FPR. The curve plots TPR vs. FPR at various classification thresholds providing a visual sense of the tradeoff between these two metrics.
The AUC — area under the curve — can help to indicate model performance. Generally, the better the model, the more the curved plotted line should pull to the top left corner of the plot, maximizing the area under the curve. An AUC score of 1 represents a perfect classifier model while a score of 0.5 — shown by the dotted line — represents a random or worthless model. Together with the ROC curve, the AUC allows you to compare and evaluate various classifier models.
Figure: The output of our model’s ROC Curve (in blue) compared to a random guess (image by author)
In binary classification, the threshold, or the cutoff between classifying an observation as a 0 or a 1, is typically set at 0.5. While logical, this default value might not always serve as the best cutoff that leads to the optimal performance of our classifier in the real world; consequently, we have an opportunity to apply business context to improve the model. Recognizing and evaluating the costs associated with Type I and Type II errors from our original model will help inform how we can adjust this classification threshold to minimize costs if we optimize for a particular model performance metric.
As noted previously, recall and precision are far more valuable than looking at accuracy alone; however, these two metrics are often in tension with one another. Improving precision typically reduces recall and vice versa. The metric to optimize for depends on the business case. Dunder Mifflin will want to minimize “false negatives” (a model incorrectly predicting a lead won’t become a customer), to ensure that good leads for the sales team don’t slip through the cracks and get de-prioritized or ignored.
We’d consider the costs of missing out on a potential customer much higher than the associated costs of trying to prospect a non-qualified customer. Therefore, we want to focus on improving the recall of our lead scoring model. To do so, we’ll slightly reduce the classification threshold from its original position in the first model which will classify more leads as customers, thus increasing both false positives and true positives. Instead of the default threshold of 0.5, we will set it to .4, meaning that any lead with a predicted probability of greater than 0.4 will be considered as a potential customer.
While we’re at it, let’s handle that pesky class imbalance issue mentioned above by training our new model with weighted classes. Class_weight is a parameter associated with classification algorithms that we can tap into to ensure we have a balanced mix of each class. By using the ‘balanced’ argument, we can automatically weigh classes inversely proportional to their frequency. Let's see make these changes and see how they impacted the performance metrics of our newly tuned model.
Figure: The evaluation metrics and Confusion Matrix of our tuned model (image by author)
The nice decrease in the costly False Negatives, from seven in our first model to three here, is a great improvement and translates to an increased Recall. Moreover, the tuning of the model had a negligible impact on the already impressive accuracy and F1 scores!
In order to better understand and explain our logistic regression model, we’ll utilize a technique called feature importance that helps quantify the positive or negative impact of the independent variables or features on the prediction for the binary outcome of our dependent customer variable.
For logistic regression, we can extract the coefficient property of each input variable and provide a simplistic yet interpretable feature importance score. At a minimum, we’d expect the feature importance outputs to confirm existing beliefs or hunches the sales team had about the characteristics of a lead. Ideally, feature importance — when combined with business context and domain expertise — can also unlock net new insights about leads for sales team members to pay attention to or prioritize moving forward.
Figure: A chart showing the feature importance breakdown of our updated model (image by author)
The above charts and list showcase the positive impact of being located in the Mid Atlantic states near Dunder Mifflin has on the likelihood of a lead being classified as an eventual customer. What’s interesting is a state such as Indiana had a high coefficient; which could help to inform the launch of more geo-targeted marketing campaigns and even support the decision of where to add a new office branch or hire additional sales reps. With regards to the sources of the leads, Adwords seemed to be the clear winner; on the other hand, Tradeshow had a negative coefficient. These findings, which align with what we noticed in the EDA section earlier in this exercise could help make the case for Dunder Mifflin to pause that campaign and shift some of that marketing budget to the most effective channels.
So we’ve built, tested, and refined the model, now we are ready to put it into action for the good folks of Dunder Mifflin! Proper communication, change management, and training will be critical for driving adoption and delivering true business impact with this lead scoring model. We’ll gather the sales team in the conference room of our choice to present the efforts and findings. But remember, don’t bury the lead (pun intended); simply and succinctly describe how the model will work in practice and explain how it will benefit them and the company. Transparency is key, for those interested in the inner workings of the model and diving into the nitty-gritty, we’ll include the details in an appendix or through an easily accessible slide deck or knowledge article.
While launching the model, have the sales team explain their processes so it can fit seamlessly into their workflow. For example, maybe the marketing team sends a weekly email updating their partners in sales about new assets or upcoming campaigns. We can try including the model’s scoring of that week’s new leads into the email or a similarly scheduled message on Slack. Or perhaps the sales reps swear by their CRM and log in every morning to start their day; if that’s the case, we’ll create a simple report or list in the CRM like the one below that includes the leads with their associated scores and information. To make things even easier and drive adoption, we can filter or sort the leads by the model’s probability score.
Figure: A table of new leads for the sales team ordered by their new Lead Score (image by author)
In turn, we want to create a joint understanding between the marketing and sales teams on how this new tool creates value and supports their success in the long run. We’ll work with sales leaders to identify a critical KPI their department is measured on, such as lead to customer conversion rates or pipeline velocity. Establishing a baseline for these metrics before launching and keeping an eye on how they’ve changed over time when the lead scoring model is up and running will be crucial.
As Dunder Mifflin company evolves, so do their customers; therefore, the lead scoring calculation will require adjustments. We’ll give the sales team members opportunities to provide real-time feedback on the model. Perhaps Jim notices that leads coming in from webinars are not proving to be quality customers after all. That type of feedback from sales and product team members will enable us to refine and eventually improve the model in the long run.
While we covered quite a bit in this tutorial, some items and topics fell just outside of the scope of this post but would be great candidates for a follow-up post. These include:
If you’d like to learn more about logistic regression or check out additional examples of these types of algorithms in action, I’ve included below a list (in no particular order) of helpful articles and tutorials I came across in the process of researching and writing this post:
Check out the full Python code here on Github. What did I forget or what would you have done differently? Any and all feedback is welcome via a comment below or drop me a note at wmc342@gmail.com.
Thanks so much for reading! Hopefully, this high-level overview of lead scoring and logistic regression showcases how the power of analytics can strengthen the crucial relationship between sales and marketing. Now armed with a helpful tool to inform decision making, the good folks at Dunder Mifflin can allocate more time to creating their next big marketing campaign (Andy’s face at the 2:00 mark gets me every time) in order to create some high-scoring leads!
A big thanks to Mick Hammock, Adrienne Raphel, Isaac Dinner, Ian MacDonald, Joanna Kelly, and Susanna Arntz for their edits and insights on this project!
Here’s a collection of definitions for some of the marketing, data science, and The Office terminology used throughout this post:
Lead: A person or business who may eventually become a client/customer.
Lead scoring: A shared sales and marketing methodology used to rank leads based on how likely they are to become a client/customer.
Firmographic data: Information that can be used to categorize businesses, such as geographic area, number of clients, type of organization, industry, technologies used, and so on.
The Dundies: An annual awards show for the Dunder Mifflin Scranton employees, which takes place at the local Chili’s restaurant.
Python: An interpreted, object-oriented, high-level programming language with dynamic semantics.
Boxplot: A method for graphically demonstrating the locality, spread, and skewness groups of numerical data through their quartiles.
Multicollinearity: The occurrence of high intercorrelations among two or more independent variables in a multiple regression model.
One-Hot Encoding: A type of vector representation in which all of the elements in a vector are 0, except for one, which has 1 as its value, where 1 represents a boolean specifying a category of the element.
Logistic regression: A statistical model used to determine if an independent variable has an effect on a binary dependent variable.
Scikit-learn: An open-source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection, model evaluation, and many other utilities.
External validity: How well the outcome of a study can be expected to apply to other settings.
Accuracy: An evaluation metric for classifier algorithms denoting how often the classifier is correct.
Confusion matrix: A tabular summary of the number of correct and incorrect predictions made by a classifier. It is used to measure the performance of a classification model.
Precision: The number of true positive results divided by the number of all positive results, including those not identified correctly.
Recall: The number of true positive results divided by the number of all samples that should have been identified as positive. Also known as sensitivity, this answers how good a classifier is at detecting positive results.
F1 Score: The harmonic mean of precision and recall.
ROC Curve: A graph showing the performance of a classification model at all classification thresholds
Imbalanced dataset: An instance where the distribution of labels across the dataset is not balanced i.e. the distribution is biased or skewed.
Feature importance: Techniques that assign a score to input features on independent variables based on how useful they are at predicting a target, dependent variable.
Pipeline velocity: The speed by which qualified leads move through a sales pipeline
RFM (recency, frequency, monetary value) analysis: A marketing method used to identify the best clients based on their spending habits. It helps predict which customers are likely to buy products again and estimate revenue from regular and new consumers.
Creed Bratton: A TV icon and personal hero of mine.
This article was first published here