To help keep pace with all of the election coverage on the 24/7 news channels, Jed has been archiving closed captioning data. However, the sheer volume of text makes it hard to analyze.
In this diary, I'll show several automatically generated images (word clouds) that I created that highlight the prevalence of certain words in election coverage on these three channels; the analysis is over the last week's worth of closed captioning feeds. Let's look at the top 20 words on CNN, MSNBC, and Fox News (from left to right):
A recent study from George Mason University found that in major network news coverage "28% of the statements were positive for Obama and 72% negative" whereas for McCain "43% of the statements positive and 57% negative". (The fact that coverage for both of them is more than 50% negative isn't that surprising - it's easy for pundits to tear into candidates.) It's hard to determine using a purely automated analysis whether references to Obama or McCain were positive or negative, but if we combine the GMU study with this one, it becomes plainly obvious that Obama gets the short end of the media stick: not only is Obama talked about far more in the media, but the majority of that coverage is negative.
Keeping that in mind, note how MSNBC has the most "balance" (relatively speaking) between "Obama" and "McCain" and Fox News has the least balance. In addition, it's possible that Fox News also references Obama by his full name (or first name) more often than the other networks, which causes "Barack" to be the third most common word.
Next, let's look at our favorite feuding TV hosts, Keith Olbermann and Bill O'Reilly, and let's throw in Wolf Blitzer as well (left to right):
The image is a bit damning for Bill O'Reilly - his coverage is far more singularly focused on Barack Obama than Keith Olbermann's is on McCain. Wolf Blitzer ends up around the media's average, which means he's skewed towards more (negative) coverage for Obama.
Finally, let's look at Anderson Cooper, Chris Matthews, and Sean Hannity:
Interestingly, it seems that both Anderson Cooper and Chris Matthews still like talking about Bill and/or Hillary Clinton fairly often in their election coverage, whereas Hannity doesn't. (However, Fox News coverage does tend to include the Clintons a fair bit, as the overall image shows.)
If you'd like to try processing closed captioning text on your own, here's how I did it. First, you need to extract and reformat the closed caption text, which is output from the Windows closed caption extraction tool:
cut -f 2 | dos2unix | sed 's/>> /\n/' | fmt
Next (optionally) you can search for content that is near a mention of Obama or McCain, to hone in on election-related coverage:
egrep -C 5 -i obama\|mccain
Next, you need to create a word frequency histogram, after filtering out unwanted garbage words (in a text file called garbage-words):
tr " " "\n" | tr -d ",.()[]\"" | ./toupper.py | sort | grep -v ":" | egrep -vix \"`cat garbage-words | tr "\n" "|"`\" | uniq -c | sort -n -r
Finally, you need to generate the word cloud images using Imagemagick:
#!/usr/bin/python
import os, sys
from math import log
font = "Helvetica-Bold"
maxfontsize = 100
firstfontsize = 0
imagelist = ""
for line in sys.stdin.readlines():
fields = line.strip().split()
count = int(fields[0])
word = fields[1].replace("'", "’")
print "word ", word
fontsize = int(pow(count,1.1) / 25)
if firstfontsize == 0:
firstfontsize = fontsize
# scale font size relative to maxfontsize
fontsize = int(float(fontsize) * float(maxfontsize) / float(firstfontsize))
cmd = "convert -font " + font + " -pointsize " + str(fontsize) + " label:'"
+ word + "' " + "tmpimage." + word + ".png"
os.system(cmd)
imagelist = imagelist + "tmpimage." + word + ".png "
cmd = "convert -background white -gravity Center " + imagelist + "-append cloud.
png"
os.system(cmd)
os.system("rm tmpimage*.*")
Are there other interesting studies that could be done with this data?