For this analysis, we have access to 2 months of real crime data for Summer 2014 in San Francisco and Seattle.
The goal of this study is to explore these cities' nightlife in a criminology context, to figure out which crimes are more frequent at night, which neighborhoods are most impacted, and how both cities compare.
Key findings: Along this analysis, we discover the following facts which will be proven via 4 different visualizations:
I chose IPython notebook to perform this analysis for several reasons:
YEAR = 2014
PERIOD = "summer"
TYPE = "incidents"
DATA_DIR = "data"
SF = "sanfrancisco"
SEATTLE = "seattle"
To do this analysis, we are using Python's excellent pandas library which allows us to represent the data as DataFrames which makes subsequent analysis much easier than doing it by hand.
Below we are creating 2 DataFrames:
sf
: contains San Francisco's crime data for Summer 2014.seattle
: contains Seattle's crime data for Summer 2014.# Load data into pandas DataFrames
import os
import pandas as pd
sf_filename = os.path.join(DATA_DIR, "%s_%s_%s_%d.csv" % (SF, TYPE, PERIOD, YEAR))
sf = pd.read_csv(sf_filename)
sf_nrows = sf.shape[0]
print "Read SF data from %s yielded %d rows" % (sf_filename, sf_nrows)
seattle_filename = os.path.join(DATA_DIR, "%s_%s_%s_%d.csv" % (SEATTLE, TYPE, PERIOD, YEAR))
seattle = pd.read_csv(seattle_filename)
seattle_nrows = seattle.shape[0]
print "Read Seattle data from %s yielded %d rows" % (seattle_filename, seattle_nrows)
assert sf_nrows == len(open(sf_filename).readlines()) - 1 != 0
assert seattle_nrows == len(open(seattle_filename).readlines()) - 1 != 0
As with any data analysis task, we start by sampling some random data in sf
to just look at it and get a feeling on what we can get out of it.
For example, we can see that some columns are not really useful for what we try to accomplish, like IncidntNum
or PdId
which are just SFPD internal IDs.
And on the other hand, some data like X
or Y
will be useful for any kind of geospatial analysis, as will the Date
and Time
for temporal analysis.
sf.sample(n=5)
We do the same for seattle
and there are 2 important things to notice here:
sf
, several columns are not useful and can be discarded.sf
but are called differently, like Longitude
and Latitude
which correspond respectively to X
and Y
in sf
.sf
, and in this state making comparisons between the 2 cities will be hard.seattle.sample(n=5)
At that point it becomes clear that, before we can do any kind of meaningful data exploration and analysis, we need to pre-process the data. In particular, the following tasks will be required:
There are several columns in sf
that have data mapping to other columns in seattle
, but they are called differently. Specifically:
X
in sf
has a 1:1 relationship to Longitude
in seattle
.Y
in sf
has a 1:1 relationship to Latitude
in seattle
.PdDistrict
in sf
has a 1:1 relationship to District/Sector
in seattle
. To simplify we want to name both District
.Descript
in sf
has a 1:1 relationship to Summarized Offense Description
in seattle
. To simplify we want to name both Description
.So in a first step, we will modify sf
to rename these columns in place.
sf.rename(columns = {"X": "Longitude", "Y": "Latitude", "PdDistrict": "District", "Descript": "Description"},
inplace=True)
Next we want to take the existing columns, process them, and extract new features which may or may not be useful during our data exploration process later on. In particular:
sf
and seattle
normalized so they can fit in a single unified DataFrame, we need to indicate for each row the City
so that we can later one have control on which city we're processing.Date
field in format mm/dd/yyyy, but if we're doing temporal analysis it would be useful to have access to individual units separately, so we'd want to have 1 column being the DayOfMonth
, another one the Month
and a last one the Year
.sf["City"] = pd.Series([SF]*sf.shape[0], index=sf.index)
sf["DayOfMonth"] = sf.Date.map(lambda x: int(x.split("/")[1]))
sf["Month"] = sf.Date.map(lambda x: int(x.split("/")[0]))
sf["Year"] = sf.Date.map(lambda x: int(x.split("/")[2]))
Another feature we want to extract is the TimeBin
. We currently have access to the time of each incident under Time
in format HH:MM, but if we want to perform time-based aggregation this is not very convenient.
Instead, what we want is to just keep the hour of the incident, and add it to the TimeBin
. Like that we have an integer representation of the time which is easier to visualize and work with, and we can aggregate hourly data very easily.
# Function that takes a Time as HH:MM and returns an int representing the hour
def time_bin(t, bin_size=30):
hour, mins = t.split(":")
return int(hour)
sf["TimeBin"] = sf.Time.map(time_bin)
Now that we have all our new features, we just need to drop
the columns that are not useful for us. We will drop the following columns:
IncidntNum
: A SFPD internal ID which we can't make use of.PdId
: Another SFPD internal ID.Location
: This is just a combination of columns X
and Y
and is redundant.Resolution
: While this tells an interesting story (how the case concluded), there is no such data in seattle
, so we will not use that.sf.drop(["IncidntNum", "PdId", "Location", "Resolution"], axis=1, inplace=True, errors="ignore")
We now have everything we need in sf
, we simply need to create a new DataFrame sf_normalized
that will have a specific order of the columns. This is because seattle
has a completely different order, so we want to make sure both DataFrames will have the exact same schema so we can fit them into a single DataFrame ultimately.
# Reindex the DataFrame
columns_index = ["City", "DayOfMonth", "DayOfWeek", "Month", "Year", "Time", "TimeBin",
"Longitude", "Latitude", "Address", "District",
"Category", "Description"
]
sf_normalized = sf[columns_index]
sf_normalized.sample(n=5)
We need to go through the same process for seattle
. As a first step, let's drop the following columns:
RMS CDW ID
: A SeattlePD internal ID, no use for us.General Offense Number
: A unique number for each incident. Seems to be a function of the date, but we don't need it.Offense Code
: A unique number identifying each type of offense. This maps to Offense Type
but is less expressive, so we get rid of it.Offense Code Extension
: Some ID, probably SeattlePD internal stuff.Census Tract 2000
: Census information which is not relevant for our analysis.Location
: This is just a combination of columns Longitude
and Latitude
and is redundant.seattle_drop_columns = ["RMS CDW ID",
"General Offense Number",
"Offense Code",
"Offense Code Extension",
"Summary Offense Code",
"Census Tract 2000",
"Location"
]
seattle.drop(seattle_drop_columns, axis=1, inplace=True, errors="ignore")
Next we rename the columns which have a different name in sf
:
Hundred Block Location
in seattle
is similar to Address
in sf
.Offense Type
in seattle
is similar to Category
in sf
.Summarized Offense Description
to Description
to make it simpler, like mentioned earlier.District/Sector
to District
to make it simpler, like mentioned earlier.seattle.rename(columns = {"Hundred Block Location": "Address", "Offense Type": "Category",
"Summarized Offense Description": "Description",
"District/Sector": "District"},
inplace=True)
Finally we want to create new features for our analysis, similarly to what we did in sf
:
City
should always be seattle
.Occurred Date or Date Range Start
), or reported date (Date Reported
). By looking at the data we see the date reported is sometimes many months after the alleged occurrence date, so we use the start range of the alleged occurred data for any temporal measure. From that we will create new features DayOfMonth
, DayOfWeek
, Time
and TimeBin
to match sf
schema.import datetime
seattle["City"] = pd.Series([SEATTLE]*seattle.shape[0], index=seattle.index)
seattle["DayOfMonth"] = seattle["Occurred Date or Date Range Start"].map(lambda x: int(x.split(" ")[0].split("/")[1]))
seattle["DayOfWeek"] = seattle["Occurred Date or Date Range Start"] \
.map(lambda x: datetime.datetime.strptime(x.split(" ")[0], "%m/%d/%Y").strftime("%A"))
seattle["Time"] = seattle["Occurred Date or Date Range Start"] \
.map(lambda x: datetime.datetime.strptime(x[11:], "%I:%M:%S %p").strftime("%H:%M"))
seattle["TimeBin"] = seattle.Time.map(time_bin)
Now we just need to reindex our DataFrame under a new one called seattle_normalized
.
# Reindex the DataFrame
seattle_normalized = seattle[columns_index]
seattle_normalized.sample(n=5)
At that point, we have 2 DataFrames sf_normalized
and seattle_normalized
with matching schemas, but we're not done yet.
A particularly important column to analyze distribution of crime data is the Category
. Unfortunately for us, Seattle and San Francisco police departments have very different ways to label incidents which don't match at all. For example, offenses related to drug trafficking are labelled DRUG/NARCOTIC
in San Francisco, but in Seattle this is much more granular and there are more than 20 different labels for drug traffick offenses. The problem is that if we need to compare crime categories, if granularity is more different in 1 city then aggregated numbers won't mean anything.
As a result, we need to manually go through the crime categories and see how they map between the 2 cities. Since San Francisco categories are fewer and cover more ground, we will use San Francisco categories as the baseline, and try to map all of Seattle's crime categories into San Francisco categories namespace.
For example, we can say that Seattle's ASSLT-AGG-WEAPON
matches San Francisco's ASSAULT
category.
category_subset = {
"RECKLESS BURNING": "ARSON",
"THEFT-OTH": "LARCENY/THEFT",
"BURGLARY-FORCE-NONRES": "BURGLARY",
"INJURY - ACCIDENTAL": "OTHER OFFENSES",
"ANIMAL-BITE": "OTHER OFFENSES",
"ANIMAL-CRUELTY": "OTHER OFFENSES",
"ANIMAL-OTH": "OTHER OFFENSES",
"ASSLT-AGG-BODYFORCE": "ASSAULT",
"ASSLT-AGG-GUN": "ASSAULT",
"ASSLT-AGG-POLICE-BODYFORCE": "ASSAULT",
"ASSLT-AGG-POLICE-GUN": "ASSAULT",
"ASSLT-AGG-POLICE-WEAPON": "ASSAULT",
"ASSLT-AGG-WEAPON": "ASSAULT",
"ASSLT-NONAGG": "ASSAULT",
"ASSLT-NONAGG-POLICE": "ASSAULT",
"BIAS INCIDENT": "NON-CRIMINAL",
"BURGLARY-FORCE-RES": "BURGLARY",
"BURGLARY-NOFORCE-NONRES": "BURGLARY",
"BURGLARY-NOFORCE-RES": "BURGLARY",
"BURGLARY-SECURE PARKING-NONRES": "BURGLARY",
"BURGLARY-SECURE PARKING-RES": "BURGLARY",
"COUNTERFEIT": "FORGERY/COUNTERFEITING",
"DISPUTE-CIVIL PROPERTY (AUTO)": "NON-CRIMINAL",
"DISPUTE-CIVIL PROPERTY (NON AU": "NON-CRIMINAL",
"DISPUTE-OTH": "NON-CRIMINAL",
"DISTURBANCE-NOISE": "NON-CRIMINAL",
"DISTURBANCE-OTH": "NON-CRIMINAL",
"DRIVE-BY": "OTHER OFFENSES",
"DUI-DRUGS": "DRIVING UNDER THE INFLUENCE",
"DUI-LIQUOR": "DRIVING UNDER THE INFLUENCE",
"ELUDING-FELONY FLIGHT": "RUNAWAY",
"EMBEZZLE": "EMBEZZLEMENT",
"ENDANGERMENT": "OTHER OFFENSES",
"ESCAPE": "RUNAWAY",
"FALSE REPORT": "OTHER OFFENSES",
"FIREWORK-POSSESS": "OTHER OFFENSES",
"FIREWORK-USE": "OTHER OFFENSES",
"FORGERY-CHECK": "FORGERY/COUNTERFEITING",
"FORGERY-CREDIT CARD": "FORGERY/COUNTERFEITING",
"FORGERY-OTH": "FORGERY/COUNTERFEITING",
"FRAUD-CHECK": "FRAUD",
"FRAUD-COMPUTER": "FRAUD",
"FRAUD-CREDIT CARD": "FRAUD",
"FRAUD-IDENTITY THEFT": "FRAUD",
"FRAUD-OTHER": "FRAUD",
"FRAUD-WIRE-ELECTRONIC": "FRAUD",
"HARASSMENT": "DISORDERLY CONDUCT",
"HOMICIDE-JUST-GUN": "ASSAULT",
"HOMICIDE-JUST-WEAPON": "ASSAULT",
"HOMICIDE-PREMEDITATED-GUN": "ASSAULT",
"ILLEGAL DUMPING": "NON-CRIMINAL",
"INJURY - OTHER": "OTHER OFFENSES",
"LIQUOR LAW VIOLATION": "LIQUOR LAWS",
"MALICIOUS HARASSMENT": "DISORDERLY CONDUCT",
"NARC-DRUG TRAFFIC LOITERING": "DRUG/NARCOTIC",
"NARC-EQUIPMENT/PARAPHENALIA": "DRUG/NARCOTIC",
"NARC-FORGERY-PRESCRIPTION": "DRUG/NARCOTIC",
"NARC-FOUND-AMPHETAMINE": "DRUG/NARCOTIC",
"NARC-FOUND-COCAINE": "DRUG/NARCOTIC",
"NARC-FOUND-HEROIN": "DRUG/NARCOTIC",
"NARC-FOUND-MARIJU": "DRUG/NARCOTIC",
"NARC-FOUND-METH": "DRUG/NARCOTIC",
"NARC-FOUND-OPIUM": "DRUG/NARCOTIC",
"NARC-FOUND-OTHER": "DRUG/NARCOTIC",
"NARC-FOUND-SYNTHETIC": "DRUG/NARCOTIC",
"NARC-FRAUD-PRESCRIPTION": "DRUG/NARCOTIC",
"NARC-POSSESS-AMPHETAMINE": "DRUG/NARCOTIC",
"NARC-POSSESS-COCAINE": "DRUG/NARCOTIC",
"NARC-POSSESS-HALLUCINOGEN": "DRUG/NARCOTIC",
"NARC-POSSESS-HEROIN": "DRUG/NARCOTIC",
"NARC-POSSESS-MARIJU": "DRUG/NARCOTIC",
"NARC-POSSESS-METH": "DRUG/NARCOTIC",
"NARC-POSSESS-OTHER": "DRUG/NARCOTIC",
"NARC-POSSESS-PILL/TABLET": "DRUG/NARCOTIC",
"NARC-PRODUCE-MARIJU": "DRUG/NARCOTIC",
"NARC-SELL-AMPHETAMINE": "DRUG/NARCOTIC",
"NARC-SELL-COCAINE": "DRUG/NARCOTIC",
"NARC-SELL-HEROIN": "DRUG/NARCOTIC",
"NARC-SELL-MARIJU": "DRUG/NARCOTIC",
"NARC-SELL-METH": "DRUG/NARCOTIC",
"NARC-SELL-SYNTHETIC": "DRUG/NARCOTIC",
"NARC-SMUGGLE-OTHER": "DRUG/NARCOTIC",
"OBSTRUCT": "OTHER OFFENSES",
"PORNOGRAPHY-OBSCENE MATERIAL": "PORNOGRAPHY/OBSCENE MAT",
"PROP RECOVERED-OTHER AGENCY": "STOLEN PROPERTY",
"PROPERTY DAMAGE - GRAFFITI": "VANDALISM",
"PROPERTY DAMAGE-NON RESIDENTIA": "VANDALISM",
"PROPERTY DAMAGE-RESIDENTIAL": "VANDALISM",
"PROPERTY FOUND": "STOLEN PROPERTY",
"PROPERTY LOST": "STOLEN PROPERTY",
"PROPERTY LOST - POLICE EQUIPME": "STOLEN PROPERTY",
"PROPERTY STOLEN-POSSESS": "STOLEN PROPERTY",
"PROPERTY STOLEN-SELL": "STOLEN PROPERTY",
"PROPERTY STOLEN-TRAFFICKING": "STOLEN PROPERTY",
"PROSTITUTION LOITERING": "PROSTITUTION",
"PROSTITUTION PATRONIZING": "PROSTITUTION",
"PROSTITUTION-ASSIST-PROMOTE": "PROSTITUTION",
"ROBBERY-BANK-BODYFORCE": "ROBBERY",
"ROBBERY-BANK-GUN": "ROBBERY",
"ROBBERY-BANK-WEAPON": "ROBBERY",
"ROBBERY-BUSINESS-BODYFORCE": "ROBBERY",
"ROBBERY-BUSINESS-GUN": "ROBBERY",
"ROBBERY-BUSINESS-WEAPON": "ROBBERY",
"ROBBERY-RESIDENCE-BODYFORCE": "ROBBERY",
"ROBBERY-RESIDENCE-GUN": "ROBBERY",
"ROBBERY-RESIDENCE-WEAPON": "ROBBERY",
"ROBBERY-STREET-BODYFORCE": "ROBBERY",
"ROBBERY-STREET-GUN": "ROBBERY",
"ROBBERY-STREET-WEAPON": "ROBBERY",
"THEFT OF SERVICES": "LARCENY/THEFT",
"THEFT-AUTO PARTS": "LARCENY/THEFT",
"THEFT-AUTOACC": "LARCENY/THEFT",
"THEFT-BICYCLE": "LARCENY/THEFT",
"THEFT-BOAT": "LARCENY/THEFT",
"THEFT-BUILDING": "LARCENY/THEFT",
"THEFT-CARPROWL": "LARCENY/THEFT",
"THEFT-COINOP": "LARCENY/THEFT",
"THEFT-LICENSE PLATE": "LARCENY/THEFT",
"THEFT-MAIL": "LARCENY/THEFT",
"THEFT-PKPOCKET": "LARCENY/THEFT",
"THEFT-PRSNATCH": "LARCENY/THEFT",
"THEFT-SHOPLIFT": "LARCENY/THEFT",
"THREATS-KILL": "ASSAULT",
"THREATS-OTHER": "OTHER OFFENSES",
"THREATS-WEAPON": "ASSAULT",
"TRAFFIC": "OTHER OFFENSES",
"URINATING/DEFECATING-IN PUBLIC": "DISORDERLY CONDUCT",
"VEH-RCVD-FOR OTHER AGENCY": "VEHICLE THEFT",
"VEH-THEFT-AUTO": "VEHICLE THEFT",
"VEH-THEFT-MTRCYCLE": "VEHICLE THEFT",
"VEH-THEFT-OTHVEH": "VEHICLE THEFT",
"VEH-THEFT-TRAILER": "VEHICLE THEFT",
"VEH-THEFT-TRUCK": "VEHICLE THEFT",
"VIOL-COURT ORDER": "OTHER OFFENSES",
"WARRANT-FUGITIVE": "WARRANTS",
"WARRARR-FELONY": "WARRANTS",
"WARRARR-MISDEMEANOR": "WARRANTS",
"WEAPON-CONCEALED": "WEAPON LAWS",
"WEAPON-DISCHARGE": "WEAPON LAWS",
"WEAPON-POSSESSION": "WEAPON LAWS",
"WEAPON-SELLING": "WEAPON LAWS",
"WEAPON-SURRENDER-EXCLUDING FIR": "WEAPON LAWS",
"WEAPON-UNLAWFUL USE": "WEAPON LAWS",
"[INC - CASE DC USE ONLY]": "OTHER OFFENSES"
}
for category, subset in category_subset.iteritems():
seattle_normalized.loc[seattle_normalized["Category"] == category, "Category"] = subset
sf_categories = set(sf_normalized["Category"].tolist())
seattle_categories = set(seattle_normalized["Category"].tolist())
print "SF crime categories: %s" % ",".join(sorted(sf_categories))
print "Seattle crime categories: %s" % ",".join(sorted(seattle_categories))
assert seattle_categories <= sf_categories
Finally we have sf_normalized
and seattle_normalized
with a common schema and common category namespace. At that point we have everything we need, and we just need to create a final DataFrame called dataset
which will be the concatenation of both sf_normalized
and seattle_normalized
.
dataset = pd.concat([sf_normalized, seattle_normalized])
total_nrows = dataset.shape[0]
print "Total number of rows after union: %d" % total_nrows
assert total_nrows == sf_nrows + seattle_nrows
dataset.sample(n=5)
Now that we have our dataset
DataFrame, we are ready to explore our data. Like mentioned at the beginning, our goal is to look at the distribution of crimes at night, so as a first step we need to figure out which crimes are more frequent at night.
We are restricting ourselves to San Francisco data only at that point to avoid bloating our visualizations with too much data, and we will compare with Seattle at the end of this analysis.
To visualize data in an efficient manner, we will be using Python's matplotlib library which offers some very powerful capabilities. In order to show visualizations inline in this notebook, we need to use the following macro:
%matplotlib inline
Let's import all the matplotlib
packages we will be using:
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colorbar as cbar
For this temporal representation, the idea is that we want to show 3 variables:
Category
.TimeBin
(not Time
because we wouldn't be able to run any aggregation).Initially we could think about doing this via a 3d-plot, but this visualization would not be very efficient as we wouldn't evaluate distances very easily and this could be misleading.
Instead, a more efficient representation would be to have a grid where category and hour are on the axes, and each grid cell will be a color whose intensity will represent the proportion of crimes in that category and hour. This will make it easy to see for a given category, for which hours there are more occurrences.
As a first step, we need to take all of San Francisco's data in dataset
and group by TimeBin
and Category
. This will tie all the crimes to their respective TimeBin
/Category
combination.
# select data from SF and group by TimeBin (hour) and crime Category
gb = dataset[dataset["City"] == SF].groupby(["TimeBin", "Category"])
From there we want to create a matrix as a numpy.array
where the x-axis is the hour, the y-axis is the category, and the cell value is the number of crimes for that category and hour.
import numpy as np
# sorted since it is more pleasing visually
categories = sorted(list(set(dataset.Category.tolist())))
# number of categories
n_categories = len(categories)
# matrix where the x-axis is the hour, and y-axis is the category
img_src = np.zeros((n_categories, 24))
# add the frequencies of crime per hour in the matrix
for group, values in gb:
timebin = group[0]
category = group[1]
value = values.shape[0]
img_src[categories.index(category)][timebin] = value
We have our matrix with the number of crimes per category/hour, but this is not enough to have a meaningful representation.
If we keep the values being simply the number of crimes, then categories with a high overall amount of occurrences (like LARCENY/THEFT
or ASSAULT
will overshadow all the other categories, so we would only see the crimes with high occurrences in our grid.
To solve it, we can normalize our matrix by row. If we divide each row element by the max value for that row, all row values will be between 0 and 1 and represent the temporal distribution of occurrences for that crime category. A number closer to 0 means that this particular hour saw a low number of crimes in that category, while a number close to 1 means that this particular hour saw a high number of crimes in that category.
# normalizing frequencies between 0 and 1 based on the max of each row
for row in img_src:
m = max(row)
for idx in range(len(row)):
row[idx] = row[idx] / m
Now we are ready to plot our matrix in a grid. For that, we need to first create a grid layout. This can be done with the grid
function in matplotlib
. We create a small helper function that we can reuse later if needed.
def preparePlot(xticks, yticks, figsize=(10.5, 6), hideLabels=False, gridColor='#999999',
gridWidth=1.0):
plt.close()
fig, ax = plt.subplots(figsize=figsize, facecolor='white', edgecolor='white')
ax.axes.tick_params(labelcolor='#999999', labelsize='10')
for axis, ticks in [(ax.get_xaxis(), xticks), (ax.get_yaxis(), yticks)]:
axis.set_ticks_position('none')
axis.set_ticks(ticks)
axis.label.set_color('#999999')
if hideLabels: axis.set_ticklabels([])
plt.grid(color=gridColor, linewidth=gridWidth, linestyle='-')
map(lambda position: ax.spines[position].set_visible(False), ['bottom', 'top', 'left', 'right'])
return fig, ax
At that point we can use matplotlib
's imshow
method which takes a matrix and will plot it according to some parameters:
interpolation
parameter allows us to interpolate each "pixel" of the image based on a specified algorithm. Here we use nearest
so that each pixel fits into each cell of the grid.aspect
controls the image aspect-ratio, we leave it to auto
to let matplotlib
handle it.cmap
represents the colormap used to display the pixel intensity. We use cm.Greys
which is the black and white colormap, meaning a pixel close to 0 will produce a white color, while a pixel close to 1 will produce a black color.# draw grid on axes
fig, ax = preparePlot(np.arange(.5, 23, 1), np.arange(.5, n_categories-1, 1), figsize=(16,14), hideLabels=True,
gridColor='#eeeeee', gridWidth=1.1)
# interpolate crime intensity per hour on the grid
image = plt.imshow(img_src, interpolation='nearest', aspect='auto', cmap=cm.Greys)
# x-axis labels
for x, y, s in zip(np.arange(-.125, 24, 1), np.repeat(-.75, 24), [str(x) for x in range(24)]):
plt.text(x, y, s, color='#999999', size='10')
# y-axis labels
for x, y, s in zip(np.repeat(-.75, n_categories), np.arange(.125, n_categories, 1), categories):
plt.text(x, y, s, color='#999999', size='10', horizontalalignment="right")
plt.title("Distribution of crimes per category depending on time of day in San Francisco", size=20, y=1.05)
plt.xlabel("Hour", color='#999999', size="20")
plt.ylabel("Crime Category", color='#999999', size="20")
ax.yaxis.set_label_position("right")
# plot the colobar to show scale
cbar = fig.colorbar(image, ticks=[0, 0.5, 1], shrink=0.25, orientation='vertical')
cbar.ax.set_yticklabels(['Low', 'Medium', 'High'])
plt.show()
From that plot, a few interesting properties emerge:
Since our goal here is to look at crimes happening at night, let's take a closer look at the crimes we can identify:
ASSAULT
crimes are constant overall throughout the day, but there is a distinct increase after 4pm, a peak around 10pm and then it goes back to normal after 2am.DRUNKENNESS
has a low number of occurrences in the day before 4pm, then steadily increases and peaks around 10pm before going back to normal around 3am.LARCENY/THEFT
crimes, the number of occurrences gradually, almost linearly, increases throughout the day, reaching critical mass around 5pm, peaking around 7pm, and then going down around midnight-1am.PROSTITUTION
seems to be almost non-existent during the day, but there are some distinct occurrences at night, especially close to midnight.ROBBERY
: the number of occurrences starts growing at 5-6pm, peaking around midnight, and falling flat around 2am. This makes sense, robberies are more likely to happen when shops are closed.VEHICLE THEFT
seems to be mostly concentrated between 5pm and midnight. This makes sense intuitively, people are more likely to steal cars when it is dark and when their owners are at home after work.Now we have identified which crimes are more common at night, and we want to focus on those and dig further.
In particular, it would be interesting to see how these crimes are distributed geographically, and see if we can create a map of San Francisco's criminality at night and which neighborhoods to avoid. We have the Longitude
and Latitude
columns in our dataset
DataFrame which can help us, but we don't really have a map of San Francisco.
Plotting geospatial data is not a trivial topic, thankfully several libraries exist in Python that we use in this analysis:
matplotlib
's basemap toolkit: This is useful to plot 2D map data, but doesn't provide much facility around that.shapely
polygons into matplotlib
patches to be represented on maps.Some interesting analyses that I drew from:
Let's import all of these modules right now to make sure all is working - some of these have complex and long installation procedures, so we should check they are working before starting:
from matplotlib.colors import Normalize
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap
from descartes import PolygonPatch
from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon
from shapely.prepared import prep
from itertools import chain
import fiona
Before we can get started, we need an extra dataset aside from the San Francisco crime data.
In order to draw a map of San Francisco, the easiest way is to leverage fiona
and use a shapefile of San Francisco. One such shapefile is available on SF's open-data portal and has been downloaded in this repository under data/SFFind_Neighborhoods
Let's read it with Fiona:
SF_NEIGHBORHOODS = "sffind_neighborhoods"
sf_neighborhoods = os.path.join(DATA_DIR, "%s.shp" % SF_NEIGHBORHOODS)
shp = fiona.open(sf_neighborhoods)
Now that we have successfully read the shapefile, we need to pre-compute some statistics which will be used to construct our map:
# Extract map boundaries
bds = shp.bounds
shp.close()
extra = 0.01
ll = (bds[0], bds[1])
ur = (bds[2], bds[3])
coords = list(chain(ll, ur))
# width, height of map
w, h = coords[2] - coords[0], coords[3] - coords[1]
Now we have enough data to create a matplotlib
Basemap
object which is the foundation of our map. Here we need to specify some geospatial parameters like which projection to use, as well as latitude and longitude coordinates for our map and where it should be centered.
On top of that, we overlay the content of our shapefile so we have the starting point for our Basemap
.
m = Basemap(
projection='tmerc',
lon_0=-122.,
lat_0=37.7,
ellps = 'WGS84',
llcrnrlon=coords[0] - extra * w,
llcrnrlat=coords[1] - extra + 0.01 * h,
urcrnrlon=coords[2] + extra * w,
urcrnrlat=coords[3] + extra + 0.01 * h,
lat_ts=0,
resolution='i',
suppress_ticks=True)
m.readshapefile(
os.path.join(DATA_DIR, SF_NEIGHBORHOODS),
'SF',
color='none',
zorder=2)
print "Read SF neighborhood data into Basemap"
Next we need to create a DataFrame representing our map data on which we can apply operations. This DataFrame will contains Polygon
objects via shapely
where each Polygon
represents a San Francisco neighborhood.
In this DataFrame we also want to include geographic information like the area size in meters and kilometers, because ultimately we will want to represent the crime density by neighborhood.
df_map = pd.DataFrame({
'poly': [Polygon(xy) for xy in m.SF],
'ward_name': [ward['name'] for ward in m.SF_info]})
df_map['area_m'] = df_map['poly'].map(lambda x: x.area)
df_map['area_km'] = df_map['area_m'] / 100000
# Draw neighborhoods with polygons
df_map['patches'] = df_map['poly'].map(lambda x: PolygonPatch(
x,
fc='#000000',
ec='#ffffff', lw=.5, alpha=1,
zorder=4))
df_map.sample(n=5)
Finally, we need some data to plot ! We can use our dataset
DataFrame and convert the Longitude
and Latitude
columns to shapely
Point
and MultiPoint
which will be the basis for drawing on our Basemap
.
We need to make sure that what we select the right data from our dataset
DataFrame before transforming it to Points
:
ASSAULT
, DRUNKENNESS
, LARCENY/THEFT
, PROSTITUTION
, ROBBERY
, VEHICLE THEFT
)def makePoints(dat):
# Create Point objects in map coordinates from dataframe lon and lat values
map_points = pd.Series([Point(m(mapped_x,mapped_y)) for mapped_x, mapped_y in zip(dat['Longitude'],dat['Latitude'])])
plt_points = MultiPoint(list(map_points.values))
hoods_polygon = prep(MultiPolygon(list(df_map['poly'].values)))
pts = filter(hoods_polygon.contains,plt_points)
return pts
night_categories = [
"VEHICLE THEFT",
"ROBBERY",
"PROSTITUTION",
"LARCENY/THEFT",
"DRUNKENNESS",
"ASSAULT"
]
sf_night_crimes = dataset[
(dataset["City"] == SF) &
(
(dataset["TimeBin"] >= 18) |
(dataset["TimeBin"] <= 6)
) &
(dataset["Category"].isin(night_categories))]
sf_night_vehicle_theft = sf_night_crimes[sf_night_crimes["Category"] == night_categories[0]]
sf_night_robbery = sf_night_crimes[sf_night_crimes["Category"] == night_categories[1]]
sf_night_prostitution = sf_night_crimes[sf_night_crimes["Category"] == night_categories[2]]
sf_night_theft = sf_night_crimes[sf_night_crimes["Category"] == night_categories[3]]
sf_night_drunk = sf_night_crimes[sf_night_crimes["Category"] == night_categories[4]]
sf_night_assault = sf_night_crimes[sf_night_crimes["Category"] == night_categories[5]]
sf_night_crimes_points = makePoints(sf_night_crimes)
sf_night_vehicle_theft_points = makePoints(sf_night_vehicle_theft)
sf_night_robbery_points = makePoints(sf_night_robbery)
sf_night_prostitution_points = makePoints(sf_night_prostitution)
sf_night_theft_points = makePoints(sf_night_theft)
sf_night_drunk_points = makePoints(sf_night_drunk)
sf_night_assault_points = makePoints(sf_night_assault)
Let's recap where we are at this point:
Basemap
instance created, with our neighborhood shapefile overlay.Points
for each of our night crime.All that is left is to scatter
the individual crime Points
in the map and render them on top of the Basemap
to have a geospatial visualization of our crime occurrences at night.
Some considerations used here to produce a visually pleasing visualization:
PROSTITUTION
, DRUNKENNESS
and ROBBERY
which are less frequent have a bigger scatter plot size so they don't disappear under the way more frequent ASSAULT
, LARCENY/THEFT
and VEHICLE THEFT
.LARCENY/THEFT
has quite a bit of overlap with the other categories, so it is represented with a cross (x) instead of a dot (o) like the others, in order to being able to distinguish every category.plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, axisbg='w', frame_on=False)
# Scatter vehicle theft occurrences
dev = m.scatter(
[geom.x for geom in sf_night_vehicle_theft_points],
[geom.y for geom in sf_night_vehicle_theft_points],
10, marker='o', lw=.25,
facecolor='cyan', edgecolor='cyan',
alpha=0.75, antialiased=True,
label='Vehicle Theft', zorder=3)
# Scatter robbery occurrences
dev = m.scatter(
[geom.x for geom in sf_night_robbery_points],
[geom.y for geom in sf_night_robbery_points],
10, marker='o', lw=5.5,
facecolor='red', edgecolor='red',
alpha=0.75, antialiased=True,
label='Robbery', zorder=3)
# Scatter prostitution occurrences
dev = m.scatter(
[geom.x for geom in sf_night_prostitution_points],
[geom.y for geom in sf_night_prostitution_points],
10, marker='o', lw=5.5,
facecolor='green', edgecolor='green',
alpha=0.75, antialiased=True,
label='Prostitution', zorder=3)
# Scatter larceny/theft occurrences
dev = m.scatter(
[geom.x for geom in sf_night_theft_points],
[geom.y for geom in sf_night_theft_points],
10, marker='x', lw=.75,
facecolor='magenta', edgecolor='magenta',
alpha=0.75, antialiased=True,
label='Larceny/Theft', zorder=3)
# Scatter drunkenness occurrences
dev = m.scatter(
[geom.x for geom in sf_night_drunk_points],
[geom.y for geom in sf_night_drunk_points],
10, marker='o', lw=5.5,
facecolor='yellow', edgecolor='yellow',
alpha=0.75, antialiased=True,
label='Drunkenness', zorder=3)
# Scatter assault occurrences
dev = m.scatter(
[geom.x for geom in sf_night_assault_points],
[geom.y for geom in sf_night_assault_points],
10, marker='o', lw=.25,
facecolor='orange', edgecolor='orange',
alpha=0.75, antialiased=True,
label='Assault', zorder=3)
ax.add_collection(PatchCollection(df_map['patches'].values, match_original=True))
m.drawmapscale(
coords[0] + 0.015, coords[1] - 0.005,
coords[0], coords[1],
2.,
barstyle='fancy', labelstyle='simple',
fillcolor1='w', fillcolor2='#555555',
fontcolor='#555555',
zorder=5)
plt.tight_layout()
plt.legend(loc="center right")
plt.title("Geographic distribution of most prevalent crimes at night in San Francisco during Summer 2014", size=15)
fig.set_size_inches(15,15)
plt.show()
So what does this map tell us? We can make plenty of observations now that we have a geospatial representation of crimes and their neighborhoods:
ASSAULT
crimes seem to be particularly prevalent in the Tenderloin neighborhood. This makes sense, since the Tenderloin has a reputation to be a pretty sketchy neighborhood.PROSTITUTION
is not found so much in the city center, but seems to be more concentrated around the Mission district. This also makes sense, because Capp Street has a reputation of having many prostitutes, so the data correlates with this story.ROBBERY
occurrences, they are pretty sparsely distributed accross the city and there is no real epicenter. But what we can clearly see is that the further West we go, the fewer robberies there are.VEHICLE THEFT
and LARCENY/THEFT
are happening pretty much uniformly everywhere in the city. The frequency is of course higher in the city center, but this is more due to it being more densely populated than anything else.DRUNKENNESS
, you can find more drunk people alongside the main big streets like Market Street. This makes sense because this is probably where most bars are located.This was a useful visualization, but we're not done yet. One of our goals was to see which neighborhoods to avoid at night. And, while we can start forming a pretty good idea at that point, it would be nice to cluster the different neighborhoods based on their crime density, and represent that on the map.
To that end we will use a classification algorithm called Jenks Natural Breaks Optimization which will take all our neighborhoods and their crime densities, and divide these into several crime categories.
The first step is to take our night crime data (for all 6 night crime categories), and identify to which neighborhood they belong, count those, and derive the crime density from it:
df_map['count'] = df_map['poly'].map(lambda x: int(len(filter(prep(x).contains, sf_night_crimes_points))))
df_map['density_m'] = df_map['count'] / df_map['area_m']
df_map['density_km'] = df_map['count'] / df_map['area_km']
# it's easier to work with NaN values when classifying
df_map.replace(to_replace={'density_m': {0: np.nan}, 'density_km': {0: np.nan}}, inplace=True)
df_map.sample(n=5)
Next we use PySAL
which has an implementation of the Natural Breaks algorithm, and apply it to the crime density per square kilometer, and ask the algorithm to divide the neighborhoods in 5 classes.
from pysal.esda.mapclassify import Natural_Breaks as nb
# Calculate Jenks natural breaks for density
breaks = nb(
df_map[df_map['density_km'].notnull()].density_km.values,
initial=300,
k=5)
prev = 0
for density, counts in zip(breaks.bins, breaks.counts):
print "%d neighborhoods have crime density D such as %d/km^2 <= D <= %d/km^2" % (counts, prev, density)
prev = density
From that, we join back the results of the Natural Breaks algorithm with our map DataFrame so that we know for each neighborhood what its class is. This is all we need in terms of data, based on that we will be able to assign different colors to each neighborhood based on their class.
We also want to create the labels for the color scale based on the density values found by the algorithm, so we can show in the map to what density each color corresponds.
# the notnull method lets us match indices when joining
jb = pd.DataFrame({'jenks_bins': breaks.yb}, index=df_map[df_map['density_km'].notnull()].index)
df_map.drop("jenks_bins", inplace=True, axis=1, errors="ignore")
df_map = df_map.join(jb)
df_map.jenks_bins.fillna(-1, inplace=True)
# assign labels to the colorbar based on the natural breaks bins
jenks_labels = ["<= %0.1f crimes / km$^2$" % b for b in breaks.bins]
jenks_labels.insert(0, 'No crimes')
Finally all that's left to do is to draw the map and assign the right colors ! We will use the Blues
colormap which is more aesthetically pleasing.
from matplotlib.colors import Normalize, LinearSegmentedColormap
plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, axisbg='w', frame_on=False)
# use a blue colour ramp - we'll be converting it to a map using cmap()
cmap = plt.get_cmap('Blues')
pc = PatchCollection(df_map['patches'], match_original=True)
# impose our colour map onto the patch collection
norm = Normalize()
pc.set_facecolor(cmap(norm(df_map['jenks_bins'].values)))
ax.add_collection(pc)
# Convenience functions for working with colour ramps and bars
def colorbar_index(ncolors, cmap, labels=None, **kwargs):
cmap = cmap_discretize(cmap, ncolors)
mappable = cm.ScalarMappable(cmap=cmap)
mappable.set_array([])
mappable.set_clim(-0.5, ncolors+0.5)
colorbar = plt.colorbar(mappable, **kwargs)
colorbar.set_ticks(np.linspace(0, ncolors, ncolors))
colorbar.set_ticklabels(range(ncolors))
if labels:
colorbar.set_ticklabels(labels)
return colorbar
def cmap_discretize(cmap, N):
if type(cmap) == str:
cmap = get_cmap(cmap)
colors_i = np.concatenate((np.linspace(0, 1., N), (0., 0., 0., 0.)))
colors_rgba = cmap(colors_i)
indices = np.linspace(0, 1., N + 1)
cdict = {}
for ki, key in enumerate(('red', 'green', 'blue')):
cdict[key] = [(indices[i], colors_rgba[i - 1, ki], colors_rgba[i, ki]) for i in xrange(N + 1)]
return LinearSegmentedColormap(cmap.name + "_%d" % N, cdict, 1024)
# Add a colour bar
cb = colorbar_index(ncolors=len(jenks_labels), cmap=cmap, shrink=0.5, labels=jenks_labels)
cb.ax.tick_params(labelsize=10)
# Show highest densities, in descending order
highest = '\n'.join(
value[1] for _, value in df_map[(df_map['jenks_bins'] == 4)][:10].sort().iterrows())
highest = 'Most criminal neighborhoods:\n\n' + highest
# Subtraction is necessary for precise y coordinate alignment
details = cb.ax.text(
-1., 0 + 0.3,
highest,
ha='right', va='bottom',
size=10,
color='#555555')
# Draw a map scale
m.drawmapscale(
coords[0] + 0.015, coords[1] - 0.005,
coords[0], coords[1],
2.,
barstyle='fancy', labelstyle='simple',
fillcolor1='w', fillcolor2='#555555',
fontcolor='#555555',
zorder=5)
plt.tight_layout()
plt.title("Crime density at night by San Francisco neighborhood during Summer 2014", size=15)
fig.set_size_inches(15, 15)
plt.show()
This map is much easier to parse visually than the previous map. It contains less information, but information is aggregated and tells a clearer story with no need for interpretation.
Here we can see that the most criminal neighborhoods are, for most of them, right in the city center. This can be explained by the fact that this area is more densely populated, and also where most activities at night happen, so this was expected.
What is interesting to notice though is that, even though they are both far from the city center, Western San Francisco (The Avenues + Twin Peaks) and SouthEast districts have very different crime patterns: there are way more crimes in the SouthEast, than there are in the West.
At that point we have a pretty good understanding of crime activity at night in San Francisco, both temporally and spacially. What would now be interesting to see is if there is any correlation between what we've found in San Francisco, and crime data in Seattle.
We already did the groundwork during pre-processing to normalize Seattle and San Francisco's datasets, so it is now a simple matter of visualizing what we previously did.
An interesting way to show the differences in crime distribution between the 2 cities is to use hierarchical clustering on normalized crime occurrences broken down by category. Thankfully Python's library seaborn provides powerful visualization techniques and, among those, a clustermap
which essentially uses hierarchical clustering to discover relationships between rows and columns in a DataFrame.
The first step is to prepare our data: we want to create a DataFrame where each row represents a category, and each column a different city. A few extra processing steps:
LARCENY/THEFT
don't end up alone in a cluster and everything else in a separate cluster. The sum of each row should be 1.# only keep night crimes
gb = dataset[(dataset["TimeBin"] >= 18) | (dataset["TimeBin"] <= 6)].groupby("Category")
category_by_city = gb["City"].value_counts().unstack().fillna(0)
# filter out crimes which don't have an occurrence in both cities
category_by_city_nonnull = category_by_city[(category_by_city[SF] > 0) & (category_by_city[SEATTLE] > 0)]
print "Crime categories not in both cities:"
print category_by_city[(category_by_city[SF] == 0) | (category_by_city[SEATTLE] == 0)]
# normalize the distribution of crimes so that each rows adds up to 1
category_by_city_norm = category_by_city_nonnull.div(category_by_city_nonnull.sum(axis=1), axis=0)
category_by_city_norm.sample(n=5)
We are now ready to plot our clustermap
, still using the Blues
colormap for consistency with what we did previously:
import seaborn as sns
g = sns.clustermap(category_by_city_norm, annot=True, cmap="Blues", fmt="g")
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
g.fig.suptitle("Normalized distribution and clustering of crimes between San Francisco and Seattle", size=15)
plt.show()
The clustermap
has rearranged the order of the categories to exhibit the different clusters. We can quite clearly here see 3 different clusters emerge:
Let's look at our night categories again:
ASSAULT
: 0.54 (SF) ~= 0.46 (Seattle)DRUNKENNESS
: not in Seattle, probably due to the way we manually labelled the Seattle categories, or could already be labbeled by SeattlePD under different categories.LARCENY/THEFT
: 0.43 (SF) ~= 0.57 (Seattle)PROSTITUTION
: 0.31 (SF) < 0.69 (Seattle)ROBBERY
: 0.28 (SF) < 0.72 (Seattle)VEHICLE THEFT
: 0.38 (SF) < 0.62 (Seattle)ASSAULT
and LARCENY/THEFT
are by far the most frequent crime categories (see previous map to convince yourself), so it makes sense that they are roughly similar between the 2 cities because it is statistically significant.
For the other categories however they are far lower in San Francisco than in Seattle. There are multiple ways to explain that:
In this analysis we discovered the following facts: