Revisiting Python Data Analysis: Part 2

Drug Line Chart

You may find the dataset for this project here. The data is titled Chemical Dependence Treatment Program Admissions: Beginning 2007. It was last updated March 4, 2015.

From the source:

NYS Office of Alcoholism and Substance Abuse Services (OASAS) certified chemical dependence treatment programs report admissions of people served in programs throughout NYS. This dataset includes the number of admissions to NYS OASAS certified treatment programs aggregated by the program category, county of the program location, age group of client at admission, and the primary substance of abuse group.

Background

I was having a lot of fun visualizing data scraped from HTML tables, but had yet to really dive into a serious dataset. I was looking for a lot of data, and thought data.gov would have a lot of cool datasets to analyze. After doing some searching around, I came across the dataset used in this project. Given my background in psychology, particularly my work as a research assistant for a clinical psychology lab, this dataset was both familiar and interesting to me. Also, around the time of this project I was volunteering for one of the country's largest detox centers, so it was a rewarding experience to be doing both amateur research and clinical work related to substance abuse — I was essentially a full-stack psychologist.

Anyway, let's review how I made the graph pictured above.

(WARNING: I created this graph early in my coding journey. Some of the code may be a bit rough, but I will point out where I would have done things differently.)

Imports

First, we will import the usual suspects:

import pandas as pd
import seaborn as sb
import matplotlib as mpl
import matplotlib.pyplot as plt

Python and Excel

The data came in a .csv file, but I converted it to .xlsx for reasons I can't quite recall. Either way, let's open the Excel file:

xlsfile = pd.ExcelFile(r'../substance_copy.xlsx')
dframe = xlsfile.parse('main')

Then let's get the sheet of interest, in this case the sheet named main:

dframe = xlsfile.parse('main')

Some ugly stuff

When I created these graphs, I was still quite a novice and was very much in a "just get it done" mindset. That said, to achieve the desired aesthetic for the graph I modified the matplotlib rcParams, which are global to the matplotlib package.

# This is bad because `rcParams` are global!
mpl.rcParams['patch.force_edgecolor'] = True
plt.rcParams['axes.facecolor'] = '#3b3b49'

A more appropriate solution is to use rcContext, as noted in this stackoverflow post.

Seaborn to the rescue

As I said, this is one of my first programming projects, so please excuse the hard-coded color values. Seaborn makes your graphs look nice and has an intuitive API, as seen below:

age_graph = sb.factorplot(
        y="Admissions", x="age_num", data=dframe,
        size=7, aspect=1.6, capsize=0.1,
        hue="Substance", legend=None,
        palette=sb.color_palette(['#4286f4',
                                  '#f4d442',
                                  '#cb42f4',
                                  '#42f498',
                                  '#f4426b'])
)

To get a cleaner aesthetic, I removed the spines by calling despine and passing it left=True, since the default arg for despine is left=False:

# Despine defaults:
# seaborn.despine(fig=None, ax=None, top=True, right=True, left=False, bottom=False, offset=None, trim=False)

sb.despine(left=True)

Then to get that nice grid overlay, we call set_style and tell it we want a "whitegrid":

sb.set_style("whitegrid")

Fine tune with `pyplot`

To customize the plot even further, we'll access the plot directly. Of course, there may be a way to add this customization with Seaborn, but when I made this in 2017 access the plot directly seemed easier. I won't add too much commentary on the following since it's pretty self explanatory.

Set the title

You'll note an escape character, \n, to ensure the title fit on the graph:

plt.title(
    "NY State Chemical Dependence\nTreatment Program Admissions 2007-2015", 
    fontsize=24, 
    color="black", 
    fontweight="heavy", 
    y=1.04
)

Set the legend

plt.legend(
    bbox_to_anchor=(.01, 0.98), loc='upper left',
    ncol=1, fontsize=14,
    frameon=True, shadow=True
)

Set the labels

plt.ylabel("Annual average\nper country", fontweight="bold", fontsize=20)
plt.xlabel("Age", fontweight="bold", fontsize=22)

Set the axis ticks

plt.xticks([0, 1, 2, 3, 4, 5],
           ['<18', '18-24', '25-34', '35-44', '45-54', '55+'],
           fontsize=16, fontweight='bold')
plt.yticks(fontsize=16, fontweight='bold')

Set custom text

I will use plt.text to show how many total admissions there are for this dataset, essentially our sample size.

First, we will get the total number of admissions and cast it as a string to make formatting a bit easier:

admissions = str(sum(dframe['Admissions']))
plt.text(
    -0.23, 
    115, 
    "Total Admissions = {0},{1},{2}".format(admissions[:1], 
                                            admissions[0:3], 
                                            admissions[3:6], 
                                            admissions[0:3]),
        fontsize=14, 
        color="white", 
        fontweight="medium", 
        fontstyle="italic"
)

Save the graph

Ah yes, so after all that we may now save the graph:

age_graph.savefig('agegraph.png')

Conclusion

This was a nice trip back to one of my first programming projects. Admittedly the code wasn't the greatest, but I am happy with how the graph came out. Visualizing data with Python will always have a special place in my heart