You may find the dataset for this project here. The data is titled Chemical Dependence Treatment Program Admissions: Beginning 2007. It was last updated March 4, 2015.
From the source:
NYS Office of Alcoholism and Substance Abuse Services (OASAS) certified chemical dependence treatment programs report admissions of people served in programs throughout NYS. This dataset includes the number of admissions to NYS OASAS certified treatment programs aggregated by the program category, county of the program location, age group of client at admission, and the primary substance of abuse group.
Background
I was having a lot of fun visualizing data scraped from HTML tables, but had yet to really dive into a serious dataset. I was looking for a lot of data, and thought data.gov would have a lot of cool datasets to analyze. After doing some searching around, I came across the dataset used in this project. Given my background in psychology, particularly my work as a research assistant for a clinical psychology lab, this dataset was both familiar and interesting to me. Also, around the time of this project I was volunteering for one of the country's largest detox centers, so it was a rewarding experience to be doing both amateur research and clinical work related to substance abuse — I was essentially a full-stack psychologist.
Anyway, let's review how I made the graph pictured above.
(WARNING: I created this graph early in my coding journey. Some of the code may be a bit rough, but I will point out where I would have done things differently.)
Imports
First, we will import the usual suspects:
import pandas as pd
import seaborn as sb
import matplotlib as mpl
import matplotlib.pyplot as plt
Python and Excel
The data came in a .csv
file, but I converted it to .xlsx
for reasons I can't quite recall. Either way, let's open the Excel file:
xlsfile = pd.ExcelFile(r'../substance_copy.xlsx')
dframe = xlsfile.parse('main')
Then let's get the sheet of interest, in this case the sheet named main
:
dframe = xlsfile.parse('main')
Some ugly stuff
When I created these graphs, I was still quite a novice and was very much in a "just get it done" mindset. That said, to achieve the desired aesthetic for the graph I modified the matplotlib rcParams, which are global to the matplotlib package.
# This is bad because `rcParams` are global!
mpl.rcParams['patch.force_edgecolor'] = True
plt.rcParams['axes.facecolor'] = '#3b3b49'
A more appropriate solution is to use rcContext
, as noted in this stackoverflow post.
Seaborn to the rescue
As I said, this is one of my first programming projects, so please excuse the hard-coded color values. Seaborn makes your graphs look nice and has an intuitive API, as seen below:
age_graph = sb.factorplot(
y="Admissions", x="age_num", data=dframe,
size=7, aspect=1.6, capsize=0.1,
hue="Substance", legend=None,
palette=sb.color_palette(['#4286f4',
'#f4d442',
'#cb42f4',
'#42f498',
'#f4426b'])
)
To get a cleaner aesthetic, I removed the spines by calling despine
and passing it left=True
, since the default arg for despine
is left=False
:
# Despine defaults:
# seaborn.despine(fig=None, ax=None, top=True, right=True, left=False, bottom=False, offset=None, trim=False)
sb.despine(left=True)
Then to get that nice grid overlay, we call set_style
and tell it we want a "whitegrid"
:
sb.set_style("whitegrid")
Fine tune with pyplot
To customize the plot even further, we'll access the plot directly. Of course, there may be a way to add this customization with Seaborn, but when I made this in 2017 access the plot directly seemed easier. I won't add too much commentary on the following since it's pretty self explanatory.
Set the title
You'll note an escape character, \n
, to ensure the title fit on the graph:
plt.title(
"NY State Chemical Dependence\nTreatment Program Admissions 2007-2015",
fontsize=24,
color="black",
fontweight="heavy",
y=1.04
)
Set the legend
plt.legend(
bbox_to_anchor=(.01, 0.98), loc='upper left',
ncol=1, fontsize=14,
frameon=True, shadow=True
)
Set the labels
plt.ylabel("Annual average\nper country", fontweight="bold", fontsize=20)
plt.xlabel("Age", fontweight="bold", fontsize=22)
Set the axis ticks
plt.xticks([0, 1, 2, 3, 4, 5],
['<18', '18-24', '25-34', '35-44', '45-54', '55+'],
fontsize=16, fontweight='bold')
plt.yticks(fontsize=16, fontweight='bold')
Set custom text
I will use plt.text
to show how many total admissions there are for this dataset, essentially our sample size.
First, we will get the total number of admissions and cast it as a string to make formatting a bit easier:
admissions = str(sum(dframe['Admissions']))
plt.text(
-0.23,
115,
"Total Admissions = {0},{1},{2}".format(admissions[:1],
admissions[0:3],
admissions[3:6],
admissions[0:3]),
fontsize=14,
color="white",
fontweight="medium",
fontstyle="italic"
)
Save the graph
Ah yes, so after all that we may now save the graph:
age_graph.savefig('agegraph.png')
Conclusion
This was a nice trip back to one of my first programming projects. Admittedly the code wasn't the greatest, but I am happy with how the graph came out. Visualizing data with Python will always have a special place in my heart