The Presidential Debates Seen Through a Prism of Art and Natural Language Processing
In this work, computational linguistics is performed on transcripts of recent debates involving Hillary Clinton and Donald Trump.
Politicians have a tendency to repeat core concepts. I determined which phrases are used most frequently, focusing on three, five and seven word phrases and applying natural language processing algorithms.
I parsed over half a million words: 218,265 from Donald Trump and 308,319 from Hillary Clinton using an open source Python Library, Natural Language Toolkit (NLTK) (1).
I then take the results and infuse them onto vector graphics and animation tools to represent the data in an artistic form. Press “run” above to interact with the art.
If you're interested in the code and larger data sets, see below.
Ngrams Presidential Debates: 3, 5 and 7 word phrases.
Hillary Clinton
Phrase | Number of Times Used |
---|---|
we have to | 114 |
a lot of | 101 |
i want to | 82 |
to try to | 73 |
we need to | 66 |
i think that | 51 |
and i think | 47 |
i think its | 46 |
were going to | 45 |
we have a | 41 |
we have a lot of | 10 |
barriers that stand in the way | 9 |
do everything I can to | 9 |
that stand in the way | 9 |
that stand in the way | 9 |
at the end of the | 8 |
stand in the way of | 8 |
with a path to citizenship | 7 |
immigration reform with a path | 7 |
barriers that stand in the way of | 8 |
comprehensive immigration reform with a path to | 7 |
that stand in the way of people | 5 |
and i will do everything i can | 5 |
have a lot of work to do | 4 |
no bank is too big to fail | 4 |
to extend the social security trust fund | 4 |
and thats what i will do as | 4 |
we have a lot of work to | 4 |
up to his or her godgiven potential | 4 |
chance to live up to his or | 4 |
Donald Trump
Phrase | Number of Times Used |
---|---|
were going to | 85 |
we have to | 78 |
a lot of | 65 |
by the way | 78 |
going to be | 48 |
let me just | 41 |
you look at | 39 |
first of all | 39 |
you have to | 36 |
going to have | 33 |
let me just tell you | 25 |
and i will tell you | 9 |
if you look at the | 9 |
that i can tell you | 9 |
were going to have a | 7 |
going to bring jobs back | 7 |
going to be able to | 7 |
i will tell you this | 6 |
have to get rid of | 6 |
tens of thousands of people | 6 |
we have no idea who they are | 4 |
he beats the rest of the field | 4 |
i beat hillary clinton in many polls | 3 |
see what happens at the end of | 3 |
have a country or we dont have | 3 |
ive hired tens of thousands of people | 3 |
we should have gotten rid of the | 3 |
lets see what happens at the end | 3 |
youre going to destabilize the middle east | 3 |
im going to bring jobs back from | 3 |
The dataset
The University of California Santa Barbara provides the main primary debates. In order to get the corpus of each candidate we extract the text of each candidate.
Python has a great Natural Language Processor called Natural Language Toolkit (NLTK). It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. (1)
import nltk # from nltk import word_tokenize from nltk.tokenize import TweetTokenizer from nltk.util import ngrams from collections import Counter import string #open and read file, make all text lowercase text = open("debates/hill/all.txt", "r").read().lower() # remove all punctionation from text. text = "".join([ch for ch in text if ch not in string.punctuation]) # that TweetTokenizer does not split the contraction into two parts: didn't, 'did', "n't" tockenize the text tknzr = TweetTokenizer() token = tknzr.tokenize(text) # counter trigrams = ngrams(token,3) fivegrams = ngrams(token,5) sevengrams = ngrams(token,7) #output three, five and seven word phrases print Counter(trigrams) print Counter(fivegrams) print Counter(sevengrams)(1) Bird, Steven, Edward Loper and Ewan Klein (2009), Natural Language Processing with Python. O’Reilly Media Inc.