The State of the State of the Union: A Data-Driven Analysis

Enzo Bergamo
7 min readMay 3, 2021


The State of the Union address represents a pivotal moment for any U.S. president. As required by Article II, Section 3, Clause 1 of the U.S. Constitution, these speeches are traditionally given by presidents at the beginning of the calendar year to a joint session of Congress and represent an opportunity for the leader of the Executive branch to describe their vision of where America is and is going. As such, they have often received extensive coverage from the media.

Given the huge attention to this event, one question remains: how much does the State of the Union really say about the state of the union? This prompted me to further investigate this traditional and ceremonial event; the results of this analysis are available below.

Obtaining the data

While additional data will be used extensively (and credited where necessary), the main source will naturally consist of the transcripts of the original State of the Union speeches. Interestingly enough, it seems like there is no central, easily accessible source for all of these pieces of data. In order to obtain the necessary information, therefore, turning to web scrapping was necessary. Thankfully, this data was almost entirely available through the Infoplease website (available at After some slight data wrangling, the final result was a collection of every State of the Union address spanning from George Washington's first in 1790 to George W. Bush's last in 2008. Additionally, using the data from The Guardian (available at, it was possible to add information regarding the specific party of each president — an extremely valuable piece of information. At this point, it was now possible to further explore the data.

Sentiment Analysis: the Two Sides of the Aisle & Presidents

A natural first step given the text nature of the dataset is to use Natural Language Processing techniques – more specifically, sentiment analysis algorithms – to extract some more quantitative data points. Plotting the average sentiment analysis per speech results in the following.

Average sentiment analysis for each of the speeches starting at the year 1900. Note that throughout this article (and unless otherwise noted) the data before the year 1900 was removed.

While no clear patterns can be observed, one interesting result can be directly calculated from here: the average sentiment for each of the two main contemporary parties. As it can be seen, there are slight differences between the two parties. Namely, the Democrat presidents seem to on average be more positive than their Republican counterparts; additionally, Republicans were the only ones to present give speeches with a negative mean sentiment. Note that by removing outliers, these results are amplified.

This difference between the two groups instigates an additional path to explore. Since the statistics for sentiment between the two parties are somewhat disparate, it is worth understanding how else the two parties diverge in terms of their annual speeches. To this end, we can explore the main topics used by presidents from each of the two parties through word clouds.

Democrat and Republican topic word clouds, respectively.

Through inspection, some notable observations can be made: Democrats tend to use words such as "health", "tax", and "communities" more than Republicans; words such as "budget", "family", and "economy" were more prevalent on the Republican side. This proves to be quite interesting, as they provide an accurate glimpse into the two parties' platforms. In a similar fashion, we can produce word clouds for individual presidents. As seen below, it very much reflects the political environment and platform of each individual.

Word clouds for Presidents George W. Bush, Bill Clinton, and Ronal Reagan.

Additionally, a simple last analysis refers to the approximate duration of these speeches by party, and the result is as follows.

The State of the Union: Coup D’Oeil at Reality

As mentioned in the introduction, the main objective of this article was to investigate the connection between the State of Union address and the actual circumstances in day-to-day America. In order to measure this connection, different proxies were compared to the count of related words; for example, in the following analysis, words like "economy", "wealth", and "GDP" were considered when comparing the speeches to the GDP.

For this analysis, data from Macrotrends was used (available at There is a clear relationship between the two: sharp decreases can be seen in both metrics in years such as 1974, 1990 as well in more recent years while sharp increases can be seen in 1976 and 1981. More directly, we can plot one metric against the other and observe a clear positive correlation between the two variables. While somewhat expected, it is quite interesting to see how immediate the effects of the economy have on the content of these speeches and vice-versa.

The same procedure can be done for a number of other indicators. A particularly interesting analysis refers to unemployment levels — the data used for this comes from the St. Louis Fed and it is available at In this case, we can see an almost perfect correlation with a time shift of one year.

Beyond the Numbers

The process above becomes to bit repetitive; nonetheless, it is surprising how well correlated these simple metrics are. It does not, however, tell the whole story. Therefore, this section aims to use additional NLP techniques to further investigate this connection during major events to American society, such as international conflicts.

At first, it is interesting to analyze how words related to Civil Rights movements appear in State of the Union speeches. As done previously, we start by analyzing the mean number of related words during those speeches. Notably, the average varied greatly depending on the party of the president: Democrat presidents mentioned those words an average of 9 times while Republicans mentioned them 4 times.

One can note that during the 1960s — the peak of the Civil Rights Movement — there was a comparatively low number of mentions of such topics. An exception is the year 1969 when then-President Lyndon B. Johnson gave his last State of the Union speech, which openly supported the Civil Rights Movement. This can be further visualized through an emotional valence plot of the speech where the sentiment of each sentence is plotted against time.

While the then rather controversial topic of the Civil Rights Movement did not play a big role in the State of the Union Speeches, the same cannot be said regarding military conflicts involving the United States. The plot below highlights periods in U.S. history with major military activity and the corresponding number of words associated with it during the State of the Union speeches.

When looking at the emotional valence plot for 1944 during the peak of World War II, we see a stark difference when compared to President Johnson's speech. Additionally, a post-war emotional valence plot is also included.

Emotional valence plots for the State of the Union speeches of 1944 and 1946. It is easy to observe that the one from 1944 exhibits quite a bit more negative messages than its post-war counterpart.

To Form a More Perfect State of the Union

Throughout this article, it was shown how certain metrics and topics — such as GDP, employment, and war — were clearly reflected in the State of the Union addresses. However, more contentious topics — the Civil Rights Movement, for example– were often left behind.

As mentioned at the beginning of this article, the State of the Union is one of the most emblematic ceremonies in the world's oldest continuous democracy. While most definitely imperfect, the State of the Union tells a story about the United States — its past, present, and future. With this article, I hope that it was possible to take a peek at all the amazing history hidden in it.