Data and Hacking

Idris Raja's blog

Mad Men: Which Characters are Being Talked About in Season 5 So Far?

| Comments

Mad Men is one of my favorite TV shows. Each episode generates a lot of internet analysis, and my go-to blog for Mad Men is Tom+Lorenzo, aka “TLo”. They have insightful posts on each episode, and even better yet a great crew of readers who leave hundreds of comments.

Mad Men has an ensemble cast where a character will figure prominently in one episode and not even show up in the next. One possible way to show how characters come and go is to count how often each character’s name shows up in the comments. So I quickly scraped TLo’s blog, and created the above chart. See bottom half of this post for the code and methodology.

I think the chart above tells a bit of a story about this season so far. Betty has shown up in only one episode, and that was in the second episode which birthed the internet meme “Fat Betty.” Betty’s name showed up the most for her in that episode, and otherwise has had a steady number of background mentions.

Episode 4 is when Pete had his fight, and sure enough he got his spike in the comments, but otherwise has been fairly quiet in the comments.

Don dominates the comments as he dominates the show as he dominates everyone around him. Roger is a steady presence, Joan is a major minor character, and Peggy had her high moment of the season so far in the last episode when she and her Catholic mother got into it.

For some reason TLo shut off the commments for previous seasons, but hopefully I can get that data or grab similar data from somewhere else. It would be fun to see this kind of graph for all seasonss to see how characters come and go.

Scrape, Parse, Visualize

Scrape

I first create a list of each of the episode urls and save them in a file episodes like this:

1
2
3
4
5
6
http://www.tomandlorenzo.com/2012/03/mad-men-a-little-kiss.html
http://www.tomandlorenzo.com/2012/04/mad-men-tea-leaves.html
http://www.tomandlorenzo.com/2012/04/mad-men-mystery-date.html
http://www.tomandlorenzo.com/2012/04/mad-men-signal-30.html
http://www.tomandlorenzo.com/2012/04/mad-men-far-away-places.html
http://www.tomandlorenzo.com/2012/04/mad-men-at-the-codfish-ball.html

Then I can grab the web pages with a simple wget command:

1
wget -i episodes

Parse

I save a file called characters which contains the major character names that we will count in each episode recap and related comments.

1
2
3
4
5
6
7
8
don
megan
peggy
roger
joan
sally
betty
pete

This shell script will then loop through each html file from the scrape step, tokenize each file, and then count the occurence of each character name. Lastly it will save the output in a format that will be easy to read into R or any other tool that can deal with flat files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
base_dir='/home/id/mad_men'
rm -rf $base_dir/tmp
mkdir $base_dir/tmp
cd $base_dir/tmp

for i in `ls -tr $base_dir/data/*.html`; do
    base=$(basename $i)
    tr [A-Z] [a-z] < $i | tr -sc [:alnum:] '\n' | sort | uniq -c | sort -nr |
        grep -f $base_dir/data/characters -w > tmp
    sed "s/$/ $base/" tmp >> counts.txt
done

cat counts.txt | sed 's/^[ ]*//' > final.txt
mv final.txt ../
rm -rf $base_dir/tmp

That above shell script will then output a file in this format where the first column is the count, the second is the character name, and the third is the episode name.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
309 don mad-men-a-little-kiss.html
220 megan mad-men-a-little-kiss.html
158 joan mad-men-a-little-kiss.html
113 peggy mad-men-a-little-kiss.html
107 roger mad-men-a-little-kiss.html
81 pete mad-men-a-little-kiss.html
69 betty mad-men-a-little-kiss.html
15 sally mad-men-a-little-kiss.html
210 betty mad-men-tea-leaves.html
191 don mad-men-tea-leaves.html
77 peggy mad-men-tea-leaves.html
73 megan mad-men-tea-leaves.html
50 sally mad-men-tea-leaves.html
45 roger mad-men-tea-leaves.html
17 pete mad-men-tea-leaves.html

Visualize

Now that file can be loaded into R, we do a bit of data wrangling and wrestling, and we get the picture on the top of this blog post. I’ve included the R code below, which is a quick and dirty hack to get the data visualized.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
require(ggplot2)
require(plyr)
require(RColorBrewer)
require(scales)

name_change <- function(x) {
    x <- sub('mad-men-', '', x)
    x <- sub('.html', '', x)
    x <- gsub('-', ' ', x)
    words <- unlist(strsplit(x, ' '))
    caps <- sapply(words, titlecase)
    return(paste(caps, collapse=' '))
}

titlecase = function(s) {
    substr(s, 1, 1) <- toupper(substr(s, 1, 1))
    return(s)
}

#load data
setwd('~/mad_men')
dat <- read.csv('final.txt', col.names=c('count', 'name', 'episode'), sep=' ',
                header=FALSE)

# transform data
dat$episode <- sapply(dat$episode, name_change)
dat$episode <- ordered(dat$episode, unique(dat$episode))
levels(dat$name) <- as.vector(sapply(levels(dat$name), name_change))

# transform data for label positions
dat <- ddply(dat, .(episode), transform, percent = count / sum(count))
latest_episode <- levels(dat$episode)[length(levels(dat$episode))]
dd <- subset(dat, episode == latest_episode, drop=TRUE)
dd <- dd[with(dd, order(name, levels(dd$name))),]
dd$cum <- cumsum(dd$percent)
dd$cum_shift <- c(0, dd$cum[-length(dd$cum)])
dd <- transform(dd, label_point = (dd$cum  + dd$cum_shift) / 2)

# create plot
g <- ggplot(arrange(dat, name, episode), aes(x = episode,y = percent))
g <- g + geom_area(aes(fill=name,group = name), position='stack')
g <- g + scale_fill_brewer()
g <- g + geom_text(data=dd, aes(x=length(levels(dd$episode)), y=label_point,
                           label=name, hjust=1))
g <- g + opts(legend.position = 'none')
g <- g + opts(title = 'Mad Men: Which Characters Are Most Talked About in Season 5?')
g <- g + xlab('Episode') + ylab('')
g <- g + scale_y_continuous(labels=percent)
ggsave(filename='mad_men_s5_e6.svg', plot=g, width=9, height=9)

Comments