Predicting Doctor Who Episode Ratings

Mar 5, 2018 · 2056 words · 10 minutes read ML Python

I don’t like Daleks. I was never the biggest Doctor Who fan, but I watched and enjoyed the David Tennant seasons back in the day. Except for the Dalek parts – I never enjoyed those.

When a friend of mine visited Cardiff in May, I was reminded of my feelings towards the cone-shaped creatures. I had assumed everyone shared my thoughts on them, but asking around revealed several Dalek-enthusiasts in my group of friends. Intrigued, I turned to IMDB for a more quantitative assessment of people’s attitudes towards the Daleks.

Daleks in their natural habitat. Photo from Nerdist

Daleks in their natural habitat. Photo from Nerdist

Data Collection

The Python module IMDbPY allows for easy retrieval of data from IMDb. To get all of the plot summaries and episode ratings for Doctor Who, we need to first search their database and identify the relevant record.

import pandas as pd
from imdb import IMDb

def get_top_tv_show(search_results):
    """ Return top search result which is a TV show """
    top_result = None
    for i in range(0, len(search_results) - 1):
        if 'tv series' == search_results[i]['kind']:
            top_result = search_results[i]
    return top_result

# initialize database
db = IMDb()

search_results = db.search_movie('Doctor Who')
top_result = get_top_tv_show(search_results)

By default, IMDbPY search results only contain the title and year of the listing, as well as what kind it is (movie, TV show, etc.). To get information on episode ratings and descriptions, we need to download additional “info packets.” Afterwards we can loop through the episodes, keeping track of the rating and episode description.

# get more info
db.update(top_result, ['main', 'episodes'])

season_numbers, episode_numbers, titles, plots, ratings = [], [], [], [], []

for season_number, season_details in top_result['episodes'].items():
    for episode_number, episode_details in season_details.items():

        # only keep the ones with a rating
        if not 'rating' in episode_details.keys():
        # MISC episode details
        titles.append(episode_details['title'].replace('\n', ''))
        plots.append(episode_details['plot'].replace('\n', ''))
        # rating

# make pandas data frame
dic = {
    'season': season_numbers, 
    'episode': episode_numbers, 
    'title': titles, 
    'plot': plots, 
    'rating': ratings
episode_data = pd.DataFrame(dic)

This step left me with a data frame of episode title, plot description, and rating that I could use for all downstream analyses.

Testing the Dalek Hypothesis

For easier statistical analysis and data visualization, I imported the processed data into R. While I have some experience with matplotlib and seaborn, it pales in comparison to the hours I have spent tweaking obscure plot parameters in R.

To test my Dalek hypothesis, I wanted to see if there was a significant difference in episode ratings between the episodes whose plot summary mentioned the word “Dalek” and the rest. The Mann-Whitney U-test is a non-parametric test for differences in population means1, and it’s implemented in the R function wilcox.test(). “Non-parametric” means that we don’t have to assume a particular distribution for the ratings. That’s a good thing – they’re bounded between 1 and 10 and definitely not normally distributed.

The boxplot shows the average rating of episodes that include the word “Dalek” in the description vs. others, along with the Mann-Whitney \(p\)-value. Refuting my original hypothesis, Dalek episodes are significantly higher rated than episodes without Daleks!

We can extend the analysis to look for other words whose episodes have higher or lower ratings than average. Julia Silge’s tidyverse package is an easy way to split the text strings to a long form bag of words model.

I ranked the words by the median rating of their episodes and plotted the top and bottom 10 words2. The dashed line shows the overall median, and each dot is coloured by its season. Dalek is ranked as the fifth best word, further evidence that most people are quite fond of them.

Ratings by word
The second and third highest rated words are “River” and “Song”. Google revealed to me that River Song is a character on the show, presumably accounting for the similar rating profile of the two words.

Predictive Model

Having done all of the data processing steps, I wanted to get more mileage out of my data by building a predictive model for the episode rating. For this somewhat frivolous problem, I took the opportunity to learn more about machine learning method I didn’t know much about: gradient boosting machines

Introduction to Gradient Boosting Machines

Like the name suggests, gradient boosting machines are a boosting method for regression and classification problems. A collection of weak classifiers \(G_1, ..., G_m\) are combined to form a prediction

\[\begin{equation} G(x) = \sum_{m=1}^M \alpha_m G_m(x) \end{equation}\]

The \(M\) classifiers are trained sequentially, and at each stage of the algorithm, the points that were the most difficult to classify in the previous stage are given more weight.

One approach to weighing the data points by classifier performance is to fit the classifier at stage \(m\) to the residuals from the classifier from stage \(m-1\). This ensures that points with large residuals (i.e. points the classifier is struggling with) are given more weight, with the exact weighting being determined by the loss function.

The gradient boosting algorithm implements this approach. At each stage \(m\), a decision tree is trained on the gradient of the loss function of the previous classifier, \(f_{(m-1)}\), evaluated at each training point \(x_i\). The prediction is then updated to \[\begin{equation} f_m(x) = f_{(m-1)}(x) + \sum_{j=1}^{J_m} \gamma_{jm}I(x \in R_{jm}) \end{equation}\]

where \(R_{jm}\) are the partition regions and \(\gamma_{jm}\) are the values at the terminal nodes of the loss gradient tree we trained at stage \(m\).

For a lot of problems, this algorithm will fully learn the training data fairly quickly. To avoid overfitting, we restrict all of the base learner trees to the same size \(J\) (typically between 2 and 8), and introduce a new parameter \(\nu\) to control the learning rate.

\[\begin{equation} f_m(x) = f_{(m-1)}(x) + \nu \sum_{j=1}^{J} \gamma_{jm}I(x \in R_{jm}) \end{equation}\]

Another hyperparameter we are interested in tuning is the number of iterations of the algorithm, \(M\). More iterations will provide a closer fit to the data, at the expense of higher computational requirements and the risk of overfitting. \(M\) and \(\nu\) are intimately related. If we set a lower learning rate, more iterations will be required to achieve the same performance on the training data.

Implementation in scikit-learn

To fit the gradient boosting machine to the Doctor Who dataset, I returned to Python and scikit-learn. Gradient boosting for regression is implemented in the GradientBoostingRegressor() function of the sklearn.ensemble module.

Before training the model we need to convert the text strings to a bag-of-words representation. I wrote a function tokenize_episodes() that converts a pandas series of strings to a matrix of indicator variables.

def tokenize_episodes(episode_plots, threshold = 3, remove_stop_words = True):
    Convert episode plots to token representation

    :param episode_plots: pandas series of episode plots
    :param threshold: recurrence threshold for word to be included in bag-of-words model
    :param remove_stop_words: boolean indicating if stop words should be removed.

    :return: pandas data frame where each row is an episode and each column is a word

    features = get_all_features(
        threshold = threshold, 
        remove_stop_words = remove_stop_words

    # convert plots to indicator variables for each feature
    episode_tokens = []

    for plot in episode_plots:
        feature_indicators = make_indicators(plot, features)

    return episode_tokens, features

episode_tokens, features = tokenize_episodes(episode_data['plot'], threshold = 0)

X = pd.DataFrame(episode_tokens, columns = features.keys(), index = episode_data['title'])
Y = episode_data['rating']

Here get_all_features() and make_indicators() are helper functions that return a vector of all words that should be included in the model and an indicator vector for a string, respectively3.

I wanted to focus on how the number of iterations affects training and test set performance and was less interested in the other hyperparameters. I used the least absolute deviations (LAD) loss4 and set the max depth of the trees and the learning rate to their defaults of 3 and 0.1, respectively.

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, 
    test_size = 0.3, 
    random_state = 7
model = GradientBoostingRegressor(
    loss = 'lad',
    n_estimators = 1000, 
    random_state = 7
    ), Y_train);

To monitor how predictions change when we add more iterations to the gradient boosting algorithm, we can use the staged_predict() function. It returns a generator object, and we need to iterate over it to access the predictions made at each stage.

train_pred = []
for pred in islice(model.staged_predict(X_train), 0, 1000):

I ran this for both the training set and the test set and plotted the results. As expected, the training error keeps decreasing as we add more weak learners, while the test error starts increasing after about 100 iterations.

Training and test error
Unfortunately, even the lowest test error rate of 0.562 isn’t that impressive. If we hadn’t bothered with the gradient boosting and just used the mean episode rating of the training data as a constant prediction, we would have achieved a mean absolute error (MAE) of 0.597 (dashed line). In other words, on average the predictions from our gradient boosting regression model is only 0.035 units closer to the true rating than a constant prediction based on the mean rating would be!

To assess whether the poor performance was a problem with the gradient boosting machine or a feature of the regression task itself, I turned to h2o. Their automated machine learning framework allows users to quickly test a large number of methods and is implemented in both Python and R.5

import h2o
from h2o.automl import H2OAutoML

# start h2o instance

# convert to AutoML format - need predictors and response in same object
train = h2o.H2OFrame( X_train.assign(rating = Y_train.values) )
test = h2o.H2OFrame( X_test.assign(rating = Y_test.values) )

# get names of predictor columns
predictors = train.columns;

# fit model
who = H2OAutoML(max_runtime_secs = 30)
    x = predictors,
    y = 'rating',
    training_frame = train,
    leaderboard_frame = test

After running the H20AutoML() procedure, we can compare the performance of different methods by looking at who.leaderboard. As it turns out, a gradient boosting machine achieves the best performance. While most of the methods are able to beat the baseline MAE of 0.597, the improvements are minimal. It would seem that there’s simply not enough information in the Doctor Who episode description to make accurate predictions about the rating.

model_id                                                 mean_residual_deviance      rmse       mae     rmsle
-----------------------------------------------------  ------------------------  --------  --------  --------
GBM_grid_0_AutoML_20180304_203855_model_3                              0.446808  0.668437  0.558715  0.071806
GLM_grid_0_AutoML_20180304_203855_model_0                              0.452206  0.672462  0.579422  0.0724
GBM_grid_0_AutoML_20180304_203855_model_1                              0.453604  0.673501  0.563871  0.072264
StackedEnsemble_BestOfFamily_0_AutoML_20180304_203855                  0.46003   0.678255  0.57982   0.072966
StackedEnsemble_AllModels_0_AutoML_20180304_203855                     0.46005   0.67827   0.579665  0.072964
GBM_grid_0_AutoML_20180304_203855_model_2                              0.460882  0.678883  0.56739   0.072785
DRF_0_AutoML_20180304_203855                                           0.482175  0.694388  0.614219  0.074378
XRT_0_AutoML_20180304_203855                                           0.482868  0.694887  0.595714  0.074741
DeepLearning_0_AutoML_20180304_203855                                  0.492726  0.701944  0.586139  0.075154
GBM_grid_0_AutoML_20180304_203855_model_0                              0.493531  0.702518  0.580334  0.075823

Testing in New Data

When I first started working on this in May, only the first four episodes of season 10 had aired. Even though my final model turned out to be a bit of a fiasco, I decided to continue with my original plan of using the remaining eight episodes as a test set.

The results are shown below. As expected, the gradient boosted machine predictions are pretty far off the actual ratings. In 5/8 cases, the constant mean predictor of 8.16 turns out to be closer to the observed episode rating than the GBM prediction. The result is a GBM mean absolute error of 0.84 against a constant prediction MAE of 0.68.

All in all, there does not seem to be much value in using the episode description to predict the rating. Alternative approaches could be to model the autocorrelation or “seasonality” in ratings. Without having looked at any data, I would assume that season finales tend to have higher ratings than mid-season episodes. Using the episode number as a predictor might allow us to improve on the constant mean prediction.


If you have higher hopes for the predictive power of episode descriptions in a different TV show, all of the source code is available on Github.


Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The Elements of Statistical Learning. Springer, 2009.

DataRobot Blog. Gradient Boosted Regression Trees.

  1. Technically speaking it’s a test for differences in location, not means.

  2. Due to the large multiple testing burden I didn’t test for significance of the individual words. With 65 words that occur in at least 5 episode descriptions and only 129 epsiodes, we’re unlikely to have good power.

  3. See the Github repository for full source code.

  4. By considering absolute value rather than squared error (such as in least squares), we give less weight to observations with large residuals.

  5. For more details on h2o, check out Kasia Kulma and Matt Dancho’s blog posts!