Introduction

Time is an American weekly news magazine and news website published in New York City. It was founded in 1923 and originally run by Henry Luce.

Time has the world’s largest circulation for a weekly news magazine. The print edition has a readership of 26 million, 20 million of whom are based in the United States. In mid-2012, its circulation was over three million, which had lowered to two million by late 2017.

The uploaded dataset sheds light on how gender diversity is maintained while choosing cover pictures for the magazine since their beginning in 1923 till 2013.

Does Time really abide by equality? In this world of media having infiltrated our lives to the greatest extent, is Time a responsible bearer of gender miscellany? These are the few questions I tried to address via this blog post.


Project Details

  • The Kaggle notebook for this project to fork is linked here.
  • The Time Cover data used here is another Kaggle dataset.
  • The Github repo for this can be accessed from here.
  • Python libraries used extensively are :
    • pandas - For analysing the data.
    • Matplotlib - For plotting the stacked bar graph.
    • seaborn - For plotting the scatter graph.


Exploratory Analysis

A. Importing libraries

To begin this exploratory analysis, we first import libraries and define functions for plotting the data using matplotlib, numpy and pandas. We then show how the gender demography has shaped itself on the covers of Time over a period of 80 years.

import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

B. Accessing the data

There is 1 csv file in the current version of the dataset:

print(os.listdir('../input'))
['TIMEGenderData.csv']

C. Reading the data

We read the data and store it in a dataframe using read_csv.

data = pd.read_csv("/TIMEGenderData.csv")

D. Analysing the data

Take a look how the data looks like.

data.head()
Year Female Male Total Female % Male %
0 1923 1 34 35 2.86% 97.14%
1 1924 4 48 52 7.69% 92.31%
2 1925 1 51 52 1.92% 98.08%
3 1926 7 46 52 13.46% 88.46%
4 1927 4 49 52 7.69% 94.23%

We see from data.head() that data has 5 columns.

  • Year - The year of release.
  • Female - The number of female personalities in issues for that whole year.
  • Male - The number of female personalities in issues for that whole year.
  • Total - Total number of issues in the year.
  • Female % - Female/Total * 100%
  • Male % - Male/Total * 100%

Teasing the data more,

data.describe()
Year Female Male Total
count 91.00000 91.000000 91.000000 91.000000
mean 1968.00000 5.835165 39.549451 45.362637
std 26.41338 3.163204 7.968359 6.433448
min 1923.00000 1.000000 20.000000 26.000000
25% 1945.50000 4.000000 33.000000 41.000000
50% 1968.00000 5.000000 41.000000 47.000000
75% 1990.50000 7.000000 46.000000 51.000000
max 2013.00000 18.000000 51.000000 52.000000

We see that from the above table there are no null values for any column in the whole dataframe.

And already, when we look at the max values in the description, the difference is… . Umm, let’s wait for that.

When we think of what we need to plot the data, i.e., how gender demography on the covers changed over the years, we can assume that we need the percentage values of both Female and Male covers. But we have two problems here.

  • The percentage values are float values, on looking at it first, and we cannot plot floats on, say, a stacked bar graph (which is the plan how I will plot the data finally).
  • If we check the type of values in ther percentage columns, we see
type(data['Female %'][0])
str

The percentage values are string values here. So apparently we cannot use them, unless we turn them first into float and then into int to be plotted.

We could have easily done this by typecasting string to float and then to int. But there is another catch. The percentage values are appended by a % symbol.

To expunge this problem, again, we can do two things.

  • Trim the symbol from the values and typecast.
  • Calculate the percentage from scratch.

I personally prefer calculating the percentage values from scratch since we already have the numbers given too. And hence, we will proceed with that here. But the first approach can be used too.

E. Modifying the data

First we drop the columns from the frame which cannot be used.

data = data.drop (['Female %', 'Male %'], axis = 1)

Now we calculate the percentage values by using data from the frame.

femaleperc = []
femaleperc = data.Female/data.Total * 100

maleperc=[]
maleperc = data.Male/data.Total * 100

We also have to change the float values in arrays to int values to be plotted.

femaleperc = [int(x) for x in femaleperc]
maleperc = [int(x) for x in maleperc]

Now we add two more columns in the frame and assign these calculated percentage values to those columns.

data = data.assign(FemalePerc = femaleperc)
data = data.assign(MalePerc = maleperc)

Now, we take a look at data again, checking whether what we did worked and we can work with the data now.

data.head()
Year Female Male Total FemalePerc MalePerc
0 1923 1 34 35 2 97
1 1924 4 48 52 7 92
2 1925 1 51 52 1 98
3 1926 7 46 52 13 88
4 1927 4 49 52 7 94

Yay! Our problem is solved. We have the percentage values in int which we can now plot on a beautiful stacked bar chart.

Visualization of Data

The percentage values have been converted, not typecasted, from float to int here and hence what we popularly call as rounded values in mathematics have not been taken. The conversion has been done by flooring the values.

For example, 2.86 should have been rounded to 3 but has been floored to 2.

And because of this, we get discrepancies in the sum of the two percentages. What we need to create beautiful stacked bars is a constant sum of percentages (Male and Female), else some bars will have a height lesser or more than others.

To fix this issue, we perform a simple trick. What we do is we find out the rows whose sum of MalePerc and FemalePerc is not equal to 100 (because we know the sum of percentages should always be 100) and adjust any one of MalePerc or FemalePerc such that the sum is equal to 100. This process is actually a work-around for rounding the values, which we did not do while conversion.

for i,row in data.iterrows():
    sum = data.FemalePerc[i] + data.MalePerc[i]
    if sum > 100: #Check whether there is any sum value above 100
        diff = sum - 100 #Find out the difference
        data.MalePerc[i] = data.MalePerc[i] - diff
        #We modify the MalePerc values for adjusting the difference. 
    elif sum < 100: #Check whether there is any sum value less than 100
        diff = 100 - sum
        data.MalePerc[i] = data.MalePerc[i] + diff

Now if we check the sum of percentages for regularizing the data,

for i,row in data.iterrows():
    sum = data.FemalePerc[i] + data.MalePerc[i]
    if sum != 100:
        print ("Error") #Print Error if anyone of the row's sum of percentages in not 100.

We run this above cell, but we do not get any message saying Error. Hence, we are good to go now!

Now finally we can proceed to making the plot.

A. Plotting on a Stacked Bar Graph

#Plotting the data using a Stacked Bar Graph
plt.figure(figsize=(25,15)) #Setting the figure size 
barWidth = 0.9 #Setting width of each bar
x_values = data.Year #For setting the x-axis values as the Years of the publications
plt.bar(x_values, data.FemalePerc, color='#b5ffb9', edgecolor='white', width=barWidth, label='Female')
plt.bar(x_values, data.MalePerc, bottom=data.FemalePerc, color='#f9bc86', edgecolor='white', width=barWidth, label='Male')
plt.xticks(x_values, rotation=90, fontsize=15)
plt.yticks(fontsize=18)
plt.legend(bbox_to_anchor=(1,1), loc=2, prop={'size':15})
#bbox_to_anchor makes legend visible outside the graph. The placement of the legend follows a different x and y-axes than the graph. For the axes which are followed by 
#legend, (0,0) is lower left point of the chart and (1,1) is the upper rightmost point of the chart. That is why the location is (1,1) such that the legend box
#is just at the upper rightmost part of the chart. loc=2 indicates upper right corner. And prop is the size of the legend.
plt.xlabel('Years', fontsize=20)
plt.ylabel('Percentage', fontsize=20, rotation=90)
plt.title('Analysis of male and female personalities on covers of TIME (1923-2013)', fontsize = 25)
plt.show()


png

B. Plotting on a Scatter Plot with a Regression Line

import seaborn as sns
plt.figure(figsize=(22,13))
sns.set(color_codes = True)
sns.set_style("darkgrid")
ax = sns.regplot(x="Year", y="MalePerc", data = data, color='#FF7F50', label='Male' )
ax1 = sns.regplot(x="Year", y="FemalePerc", data = data, color='#008000', label='Female')
plt.xlabel('Year', fontsize = 15)
plt.ylabel('Percentage', fontsize = 15)
plt.title('Trend of male and female covers on Time (1923-2013)', fontsize = 20)
plt.legend(bbox_to_anchor=(1,1), loc=2, prop={'size':15})
plt.show()


png

Conclusion

  • From the stacked bar graph : The stacked bar graph above shows how the covers of Time for a period of 80 years have preferred men over women constantly and by a huge huge margin, especially during the early years and in some years of the 1940s.

  • From the scatter plot : It is, though, a little consolation that the age-old trend has improved and increased (clearly visible from the negative sloped line for Males and positive-sloped line for Females) over the past years especially during the 1970s and continuously from 1980s, but with a dip in 1995 and again a continuous dip after 2005, which had the highest percentage of females ever from the beginning. It is for the authorities of Time to speak what happened after 2005 and in the years which saw the lowest percentage, years 1925, 1942 and 1944.