From None to Severe: Predicting Severity in Movie Scripts
Yigeng Zhang
, Mahsa Shafaei
, Fabio A. González
and Thamar Solorio
University of Houston
Universidad Nacional de Colombia
{yzhang168,mshafaei,tsolorio}@uh.edu
Abstract
In this paper, we introduce the task of predict-
ing severity of age-restricted aspects of movie
content based solely on the dialogue script. We
first investigate categorizing the ordinal sever-
ity of movies on 5 aspects: Sex, Violence, Pro-
fanity, Substance consumption, and Frighten-
ing scenes. The problem is handled using a
Siamese network-based multitask framework
which concurrently improves the interpretabil-
ity of the predictions. The experimental results
show that our method outperforms the previ-
ous state-of-the-art model and provides use-
ful information to interpret model predictions.
The proposed dataset and source code are pub-
licly available at our GitHub repository
1
.
1 Introduction
Estimating the severity level of objectionable con-
tent for movies can provide convenience for users
to judge whether a movie is suitable for watch-
ing. For example, parents may want to make sure
there are no violent scenes in a movie when they
plan to watch it with their kids, because exposure
to violent scenes may increase youth aggressive
behavior and decrease their empathy (Anderson
et al., 2017). However, existing rating systems (e.g.,
MPAA) only provide simple age restrictions and do
not include the suitability level on a specific aspect
of the content. Furthermore, a system that can au-
tomatically track the severity level of objectionable
content helps the creative professionals evaluate the
age suitability of their work. They may get assisted
by this function and adjust the product creation or
marketing strategy based on the corresponding tar-
get audiences. This system can be easily applied
to any dialogue-intensive compositions like novel
and screenplay writing. Content evaluation and
intervention by the writers can happen at any stage
of production to assess age-restricted contents.
1
https://github.com/RiTUAL-UH/Predicting-Severity-in-
Movie-Scripts.
In this work, we propose to solve the problem
of predicting the severity of age-restricted content
solely using the dialogue script data. Text is much
more lightweight than visual data (such as images
and videos), so the processing procedure can be
more efficient and scalable considering the increas-
ing fidelity of multimedia content. We initiate our
exploration on movies from five aspects of contents:
Sex & Nudity, Violence & Gore, Profanity, Alco-
hol, Drugs & Smoking, and Frightening & Intense
Scenes as used in IMDB
2
Parent Guide.
There are a small number of previous works that
studied modeling age-restricted content. (Shafaei
et al., 2020) initiated the research of predicting
MPAA ratings of the movies leveraging movie
script and metadata. (Martinez et al., 2019) fo-
cused on violence detection using movie scripts
while (Martinez et al., 2020) expanded the scope
to violence, substance abuse, and sex. Both works
intended to predict the severity of age-restricted
content into three manually defined levels: low,
mid, and high. In this work, We introduce two
more aspects of interest: Frightening and Profan-
ity. Instead of manually downgrading severity lev-
els into three categories, we explore with a more
challenging setting: rating on 4 originally defined
fine-grained severity levels from collective rating
by customers: None, Mild, Moderate, and Severe.
The major contributions of our research can be
summarized as follows: (1) This work is the first
attempt to solve the age-restricted content severity
predicting problem from 5 aspects. We studied mul-
tiple baselines and presented a competitive method
to inspire future exploration. (2) To our best knowl-
edge, the dataset we developed is the first publicly
available dataset for this task. The size is roughly
ve times larger than the restricted datasets from
the previous works. (3) We proposed an effective
multitask ranking-classification framework to solve
2
https://www.imdb.com/, one of the most visited online
databases of film, TV and celebrity content.
arXiv:2109.09276v2 [cs.CL] 3 Oct 2021
this problem. Our method dealt with long movie
scripts successfully and achieved state-of-the-art
results in all five aspects of age-restricted contents
with rich interpretability.
2 Methodology
We investigate predicting the severity of 5 objec-
tionable aspects of movies. This problem is for-
mulated as a multi-class classification task. The
average length of the dialogue scripts is around
10,000 words, which drastically exceeds the limit
of current popular Transformer-based models. To
leverage the strong semantic representation capa-
bility of the Transformers, we propose to represent
each utterance as the basic unit, and further encode
the context with the recurrent modules. Finally,
we use a fully connected layer on top of the en-
coded representations to produce the classification
predictions.
For this model, we first leverage SentenceTrans-
formers (Reimers and Gurevych, 2019) to encode
each dialogue utterance. Then, a Bi-directional
LSTM encoder is deployed to model the sequen-
tial interrelations of the utterance flow. We finally
apply a max-pooling operation on all time steps of
the hidden states of the recurrent module to get the
document representation for classification follow-
ing the practice in (Howard and Ruder, 2018). We
also study another strong word-level deep learning
model, TextRCNN (Lai et al., 2015), to probe the
significance of lexical signals.
2.1 The Multitask Ranking-Classification
Framework
The severity of a particular age-restricted content
is a relative concept. People assign severity ratings
to movies based on their own experiences and per-
sonal beliefs. Meanwhile, the severity levels are
ordinal variables instead of independent categorical
classes. Therefore, customers can gain a vivid un-
derstanding of the severity levels of an unfamiliar
movie when comparing it to some examples (e.g.,
previous watched ones).
The general model development algorithm is de-
scribed as shown in Algorithm 1. We assume that
learning to compare movies on their severity is a
proxy to understand how the model differentiates
severity. So we propose a pairwise ranking objec-
tive (Hüllermeier et al., 2008) as an auxiliary task
to probe into the model behavior for a more inter-
pretable prediction. Other than existing multitask
[ 0, 1, 0, 0 ]
Encoder
Pooling
Movie A (x
t
a
)
Mild
Encoder
Pooling
Movie B (x
t
b
)
Mild
Aggregation (concat)
Ranker
[ 0, 1, 0 ]
Less / Equal / More
Tied
weights
Classifier
[ 0, 1, 0, 0 ]
None / Mild / Moderate / Severe
Joint loss
y
t
a/b
c
t
a/b
r
t
ab
cpr
l
c
+l
r
Figure 1: Multitask pairwise ranking-classification net-
work.
practices for text classification (Liu et al., 2016;
Zhang et al., 2017), this framework is based on
a Siamese network with tied-weights for both in-
stance classification and comparison, and it can be
adopted by any backbone encoder. Here we apply
it to the Bi-LSTM + Transformers model (RNN-
Trans) and TextRCNN model. The classification
and the ranking objectives learn over two individual
cross-entropy losses. Then the model is optimized
on a joint loss of classification and ranking. The
model structure is illustrated in Figure 1.
Algorithm 1: Ranking-Classification
Input:
training instance set
X
t
with severity
label set Y
t
, ranking-classification
model f, comparison operation cpr,
classification/ranking loss L
c
/L
r
.
Output: multitask ranking-classification
model
ˆ
f
1 Function RANK-CLS(f, X
t
, Y
t
):
2 initialization;
3 while not stopping criteria do
4 randomly pick x
t
i
, x
t
j
X
t
with
corresponding y
t
i
, y
t
j
Y
t
;
5 c
i,j
, r
ij
f(x
t
i
, x
t
j
);
6 l
c
L
c
(c
i,j
, y
t
i,j
);
7 l
r
L
r
(r
ij
, cpr(y
t
i
, y
t
j
));
8
ˆ
f argmin
f
(l
c
+ l
r
);
9 end
10 return
Method Sex Violence Profanity Substance Frightening Avg
Baseline
model
LR + Glove Avg. 33.87 46.35 48.06 29.27 41.38 39.79
SVM + Glove Avg. 27.48 41.88 44.16 18.68 35.42 33.52
TextCNN 38.19 46.61 63.95 37.16 44.82 46.14
BERT 33.73 36.29 49.48 34.58 37.75 38.37
Backbone
model
TextRCNN 43.13 52.51 66.49 41.79 49.74 50.73
RNN-Trans 44.76 55.72 62.32 42.39 50.95 51.23
Proposed
multi-task model
TextRCNN S-MT 43.01 52.59 67.26 43.92 50.36 51.43
RNN-Trans S-MT 45.68 55.90 62.65 42.21 51.55 51.60
Table 1: Severity prediction macro F1 scores on test data.
The auxiliary ranking component will distin-
guish the severity difference out of the train-
ing pairs, with a learning objective of predicting
lower/equal/higher severity between the two in-
stances. By introducing this function, we can apply
model
f
to compare the severity level of any pair
of movies given an aspect.
3 Dataset
The dataset used in this work was developed based
on the script data used in (Shafaei et al., 2019,
2020). We collected the up-to-date user ratings for
age-restricted content from IMDB.com for more
than 15,000 movies. The age-restricted aspects
are adopted from the Parents Guide section of each
movies, and there are ve aspects: Sex & Nudity, Vi-
olence & Gore, Profanity, Alcohol, Drugs & Smok-
ing, and Frightening & Intense Scenes. Each of
the aspects has four severity levels for the users to
rate on the corresponding movies from low to high,
which are None, Mild, Moderate, and Severe. In
this work, we pick the ratings on the website as the
category label for each aspect.
After collecting the user ratings, we filter out
movies with less than 5 votes to make sure the
collected severity level is robust to use. At last,
we have roughly 4,400 to 6,600 movies for each
aspect.
The movie scripts have a median/average length
of around 10,000 words. The vocabulary size of
each aspect is roughly 330,000 to 450,000. De-
tailed dataset descriptions are attached in the ap-
pendix.
Comparing to the previous works with restricted
data access (Martinez et al., 2019, 2020), our
dataset is roughly five times larger, and the data
source is free to access. We made the updated
dataset publicly available with the same data parti-
tions in this work for reproducibility purposes.
Method Sex. Vio. Pro. Sub. Fri.
(Shafaei et al., 2020) 29.21 36.65 50.57 33.48 27.82
(Martinez et al., 2020) 40.91 53.02 60.51 35.60 48.81
TextRCNN S-MT 41.27 54.11 69.51 43.56 47.18
RNN-Trans S-MT 44.66 55.29 64.01 42.63 51.03
Table 2: Performance benchmarking with related ap-
proaches over the same 10-fold cross-validation.
4 Experimental Results
We evaluate the effectiveness of different meth-
ods on macro F1 score because the data is imbal-
anced. The dataset is first split and stratified using
an 80/10/10 ratio for training, development, and
test for each age-restricted aspect. The experimen-
tal results on the test set are reported in Table 1.
Baseline models include average GloVe embed-
ding (Pennington et al., 2014) with SVM/Logistic
Regression, TextCNN (Kim, 2014), and BERT (De-
vlin et al., 2019). The proposed multitask model
outperforms multiple baselines by a compelling
margin in all aspects. Statistical significance test
shows introducing the ranking subtask does no
harm to model performance.
The proposed RNN-Trans multitask Siamese
model (RNN-Trans S-MT) dominates on Sex &
Nudity, Violence & Gore, and Frightening & in-
tensive scenes while the proposed TextRCNN mul-
titask Siamese model (TextRCNN S-MT) works
best in Profanity and Alcohol, Drugs & Smoking.
This outcome is intuitive because Profanity is the
only aspect that has an overt pattern such as bad
words and abusive language included in the dia-
logue script. Those utterances are neither missing
nor latent to both audiences and the NLP model.
So word-level models can have advantages over the
utterance-level model in catching such signals. As
for Frightening & intensive scenes and Violence &
Gore, they are more challenging for the model to
make inference because the data lacks visual and
audio information. Dialogues might sometimes
imply scenes such as a violent fight with cursing
Movie Aspect Gold label Prediction None Mild Moderate Severe
Deadpool (2016) Profanity Severe Severe 3
Pride & Prejudice (2005) Violence None None 3
Django Unchained (2012) Sex Moderate Mild 7
A Clockwork Orange (1971) Substance Moderate Mild 7
The Greatest Showman (2017) Frightening Mild Mild 3
Table 3: Analysis on successful/unsuccessful test examples. Predictions with 3 mean correct while with
7 represents incorrect. For candidate comparators in each severity level, a black circle indicates the model be-
lieves the test sample has a higher severity level than the comparator. Similarly, a gray circle means the test
sample has an equal severity level, and a light gray circle means lower severity. The comparison results come from
best-performing models of each aspect.
words or a frightening shot with screaming, how-
ever, not all such scenes come with particular di-
alogues. For Sex & Nudity and Alcohol, Drugs
& Smoking, it is even more difficult to infer from
plain text without any visual evidence.
We also experiment with models from related
previous works by (Shafaei et al., 2020; Martinez
et al., 2020). The former leverages LSTM with at-
tention and NRC emotion to model the textual sig-
nal to predict MPAA ratings for the movie with the
script; the latter uses RNN on dialogue encoded by
a pretrained MovieBERT model and sentiment em-
beddings to predict severity of different aspects si-
multaneously. For fair and reliable comparison, we
conduct the benchmarking within the same setting
of using the dialogue script as the only available
information in the same 10-fold cross-validation
configuration from (Martinez et al., 2020). Table 2
shows our proposed method outperformed previous
works by a large margin.
To conclude, our proposed method works reli-
ably better in Frightening & intensive scenes and
Violence & Gore due to Transformer’s strong abil-
ity to capture latent implications behind utterances.
While the word-level model, TextRCNN, performs
better in detecting overt signals in Profanity. For
Sex & Nudity and Alcohol, Drugs & Smoking, our
method still achieved the state-of-the-art although
modeling the severity for these aspects remains
challenging.
5 Discussion and Analysis
We investigate several popular movies with at least
200,000 IMDB ratings from the test set of each
aspect. We collect the top 5 movies with the largest
number of severity ratings from each severity level
as comparators, then do the pairwise ranking be-
tween each movie and all its comparators for test.
Comparison results are shown in Table 3.
Successful examples:
For the movie Deadpool
(2016), the model gives a correct prediction on the
Profanity aspect with a convincing signal from pair-
wise ranking. This movie is determined to have a
higher severity level in profanity than any candi-
date comparators from None/Mild/Moderate level.
For Severe-level comparators, the test instance is
lower than one, equal to three, and higher than one
of the comparators. The movie Pride & Prejudice
(2005) comes with a correct prediction on Violence.
Four comparators from None show they have equal
severity while only one denotes higher. Three Mild
comparators give equal results while two indicate
the test movie is lower than them. For moderate
and severe comparators, all of them evince the test
movie has lower severity. We can therefore reason
that Pride & Prejudice (2005) has a severity level
of None in terms of violence content. There is a
similar case as presented in The Greatest Showman
(2017). It tends to appear higher than none but
roughly Mild level among the comparators. Then
the model gives it a correct prediction to Mild.
Therefore, by investigating the pairwise ranking
results, the interpretation of this prediction is more
vivid to make sense to users concerning the relative
severity level.
Unsuccessful examples:
For the movie Django
Unchained (2012) on Nudity and the movie A
Clockwork Orange (1971) on Alcohol, Drugs &
Smoking, the ranking gives a very clear pattern:
higher than none, equal to mild, and lower than
moderate. However, the actual severity level is
moderate instead of mild. We conservatively ar-
gue aspects such as Nudity and Alcohol, Drugs &
Smoking can hardly be inferred from dialogue text.
It tends to be unsurprising if there are adult scenes
or drinking scenes existing in a movie without any
verbal indication.
6 Conclusion & Future Work
In this work, we applied a deep learning-based
method to predict severity of age-restricted con-
tent based on movie script data. The experimen-
tal results show the proposed multi-task ranking-
classification model outperforms the previous state-
of-the-art method and can give rich interpretabil-
ity by demonstrating severity using example com-
parator movies. Our work provides a reasonable
groundbreaking exploration in this research topic
for the community. For future work, we propose to
investigate other modalities to capture relevant pat-
terns and fine-grained aspects like violence types.
Acknowledgements
This work was partially supported by the National
Science Foundation under grant # 2036368. We
would like to thank the anonymous EMNLP re-
viewers for their feedback on this work.
References
Craig A Anderson, Brad J Bushman, Bruce D
Bartholow, Joanne Cantor, Dimitri Christakis,
Sarah M Coyne, Edward Donnerstein, Jeanne Funk
Brockmyer, Douglas A Gentile, C Shawn Green,
et al. 2017. Screen violence and youth behavior. Pe-
diatrics, 140(Supplement 2):S142–S147.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under-
standing. In Proceedings of the 2019 Conference
of the North American Chapter of the Association
for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers),
pages 4171–4186, Minneapolis, Minnesota. Associ-
ation for Computational Linguistics.
Jeremy Howard and Sebastian Ruder. 2018. Universal
language model fine-tuning for text classification. In
Proceedings of the 56th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 1:
Long Papers), pages 328–339, Melbourne, Australia.
Association for Computational Linguistics.
Eyke Hüllermeier, Johannes Fürnkranz, Weiwei Cheng,
and Klaus Brinker. 2008. Label ranking by learn-
ing pairwise preferences. Artificial Intelligence,
172(16):1897–1916.
Yoon Kim. 2014. Convolutional neural networks
for sentence classification. In Proceedings of the
2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 1746–1751,
Doha, Qatar. Association for Computational Lin-
guistics.
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.
Recurrent convolutional neural networks for text
classification. Proceedings of the AAAI Conference
on Artificial Intelligence, 29(1).
Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016.
Recurrent neural network for text classification with
multi-task learning. In Proceedings of the Twenty-
Fifth International Joint Conference on Artificial In-
telligence, IJCAI’16, page 2873–2879. AAAI Press.
Victor Martinez, Krishna Somandepalli, Yalda
Tehranian-Uhls, and Shrikanth Narayanan. 2020.
Joint estimation and analysis of risk behavior
ratings in movie scripts. In Proceedings of the
2020 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 4780–4790,
Online. Association for Computational Linguistics.
Victor R. Martinez, Krishna Somandepalli, Karan
Singla, Anil Ramakrishna, Yalda T. Uhls, and
Shrikanth Narayanan. 2019. Violence rating predic-
tion from movie scripts. Proceedings of the AAAI
Conference on Artificial Intelligence, 33(01):671–
678.
Jeffrey Pennington, Richard Socher, and Christopher
Manning. 2014. GloVe: Global vectors for word
representation. In Proceedings of the 2014 Confer-
ence on Empirical Methods in Natural Language
Processing (EMNLP), pages 1532–1543, Doha,
Qatar. Association for Computational Linguistics.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
BERT: Sentence embeddings using Siamese BERT-
networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
3982–3992, Hong Kong, China. Association for
Computational Linguistics.
Mahsa Shafaei, Adrian Pastor Lopez-Monroy, and
Thamar Solorio. 2019. Exploiting textual, visual
and product features for predicting the likeability of
movies. The 32nd International FLAIRS Confer-
ence.
Mahsa Shafaei, Niloofar Safi Samghabadi, Sudipta Kar,
and Thamar Solorio. 2020. Age suitability rating:
Predicting the MPAA rating based on movie dia-
logues. In Proceedings of the 12th Language Re-
sources and Evaluation Conference, pages 1327–
1335, Marseille, France. European Language Re-
sources Association.
Honglun Zhang, Liqiang Xiao, Yongkun Wang, and
Yaohui Jin. 2017. A generalized recurrent neural
architecture for text classification with multi-task
learning. In Proceedings of the Twenty-Sixth Inter-
national Joint Conference on Artificial Intelligence,
IJCAI-17, pages 3385–3391.
A Appendix
A.1 Document Length Distribution
The document length distribution of each aspect
is shown in Figure 2. Different aspects share a
similar document length distribution. The average
length and the median length of the movie scripts
are around 10,000 words.
A.2 Label Distribution
The severity label distribution for each aspect is
unbalanced to a greater or lesser extent, as shown
in Figure 3.
A.3 Data Separation
The training, development, and test separation of
each age-restricted aspect is shown in Table 4.
Aspect Sex Violence Profanity Substance Frightening
Train 5200 3921 3910 3538 3553
Dev 651 491 489 443 445
Test 651 491 489 443 445
Table 4: The number of instances for each aspect.
A.4 Computing Infrastructure & Settings of
Experiments
We implement neural network models using Py-
Torch 1.6.0 and PyTorch Lightning 1.0.2. The ex-
periments are executed on NVIDIA Tesla P40.
For model development, we use Glove embed-
ding of 300-dimension trained on Wikipedia 2014
+ Gigaword 5 (6B tokens) for TextCNN, TextR-
CNN models. Three 2D convolution modules of
TextCNN are with kernel sizes 3, 4, 5 respectively.
They all have 1 input channel and 10 output chan-
nels. BERT model uses the bert-base-uncased
model provided by the Python library Transformers.
The proposed model use sentence embeddings of
768-dimension from SentenceTransformer based
on BERT-large. Bi-LSTM used in the proposed
model and TextRCNN has a hidden size of 200
for each direction. All experiments with neural
networks use Adam optimizer to optimize with a
learning rate set to 0.001.
Figure 2: Document length distribution for each aspect. The plots use a logarithmic scale on the x-axis (document
length).
Figure 3: Label distribution for each aspect. 0-None, 1-Mild, 2-Moderate, 3-Severe.