Crime Analytics in Seattle and San Francisco¶

For this analysis, we have access to 2 months of real crime data for Summer 2014 in San Francisco and Seattle.

SF's data has been extracted from their open-data portal into this repository under data/sanfrancisco_incidents_summer_2014.csv
Seattle's data has been extracted from their open-data portal into this repository under data/sanfrancisco_incidents_summer_2014.csv

The goal of this study is to explore these cities' nightlife in a criminology context, to figure out which crimes are more frequent at night, which neighborhoods are most impacted, and how both cities compare.

Key findings: Along this analysis, we discover the following facts which will be proven via 4 different visualizations:

Certain crime categories such as robberies, assault, theft, prostitution and drunkenness happen mostly in the evening/night in San Francisco.
These crimes happen mostly in the city center and southeast parts of San Francisco at night, but the west side is relatively safe at night.
Violent crimes such as assault happen as often at night in San Francisco as in Seattle (with a slight edge to San Francisco), but non-violent crimes happen more often in Seattle.

I chose IPython notebook to perform this analysis for several reasons:

Convenience: I can run each cell independently, making it easy to go back and forth to rerun specific parts of my analysis without having to run everything from scratch.
Reproducibility: Anyone reading this notebook can easily take it and run it on their own machine directly without any extra work needed.

YEAR = 2014
PERIOD = "summer"
TYPE = "incidents"
DATA_DIR = "data"
SF = "sanfrancisco"
SEATTLE = "seattle"

To do this analysis, we are using Python's excellent pandas library which allows us to represent the data as DataFrames which makes subsequent analysis much easier than doing it by hand.

Below we are creating 2 DataFrames:

sf: contains San Francisco's crime data for Summer 2014.
seattle: contains Seattle's crime data for Summer 2014.

# Load data into pandas DataFrames
import os
import pandas as pd

sf_filename = os.path.join(DATA_DIR, "%s_%s_%s_%d.csv" % (SF, TYPE, PERIOD, YEAR))
sf = pd.read_csv(sf_filename)
sf_nrows = sf.shape[0]
print "Read SF data from %s yielded %d rows" % (sf_filename, sf_nrows)

seattle_filename = os.path.join(DATA_DIR, "%s_%s_%s_%d.csv" % (SEATTLE, TYPE, PERIOD, YEAR))
seattle = pd.read_csv(seattle_filename)
seattle_nrows = seattle.shape[0]
print "Read Seattle data from %s yielded %d rows" % (seattle_filename, seattle_nrows)

assert sf_nrows == len(open(sf_filename).readlines()) - 1 != 0
assert seattle_nrows == len(open(seattle_filename).readlines()) - 1 != 0

Read SF data from data/sanfrancisco_incidents_summer_2014.csv yielded 28993 rows
Read Seattle data from data/seattle_incidents_summer_2014.csv yielded 32779 rows

As with any data analysis task, we start by sampling some random data in sf to just look at it and get a feeling on what we can get out of it.

For example, we can see that some columns are not really useful for what we try to accomplish, like IncidntNum or PdId which are just SFPD internal IDs. And on the other hand, some data like X or Y will be useful for any kind of geospatial analysis, as will the Date and Time for temporal analysis.

sf.sample(n=5)

We do the same for seattle and there are 2 important things to notice here:

Similarly to sf, several columns are not useful and can be discarded.
Some columns have the same meaning as in sf but are called differently, like Longitude and Latitude which correspond respectively to X and Y in sf.
The schema is completely different from sf, and in this state making comparisons between the 2 cities will be hard.

seattle.sample(n=5)

At that point it becomes clear that, before we can do any kind of meaningful data exploration and analysis, we need to pre-process the data. In particular, the following tasks will be required:

Get rid of columns that do not bring any value for our goals.
Rename columns so that identical fields are called the same in both DataFrames.
Process existing columns to extra new features that will be used for our analysis.

Data Preprocessing and Normalization¶

San Francisco¶

There are several columns in sf that have data mapping to other columns in seattle, but they are called differently. Specifically:

X in sf has a 1:1 relationship to Longitude in seattle.
Y in sf has a 1:1 relationship to Latitude in seattle.
PdDistrict in sf has a 1:1 relationship to District/Sector in seattle. To simplify we want to name both District.
Descript in sf has a 1:1 relationship to Summarized Offense Description in seattle. To simplify we want to name both Description.

So in a first step, we will modify sf to rename these columns in place.

sf.rename(columns = {"X": "Longitude", "Y": "Latitude", "PdDistrict": "District", "Descript": "Description"}, 
          inplace=True)

Next we want to take the existing columns, process them, and extract new features which may or may not be useful during our data exploration process later on. In particular:

Since we are trying to get both sf and seattle normalized so they can fit in a single unified DataFrame, we need to indicate for each row the City so that we can later one have control on which city we're processing.
We have the Date field in format mm/dd/yyyy, but if we're doing temporal analysis it would be useful to have access to individual units separately, so we'd want to have 1 column being the DayOfMonth, another one the Month and a last one the Year.

sf["City"] = pd.Series([SF]*sf.shape[0], index=sf.index)
sf["DayOfMonth"] = sf.Date.map(lambda x: int(x.split("/")[1]))
sf["Month"] = sf.Date.map(lambda x: int(x.split("/")[0]))
sf["Year"] = sf.Date.map(lambda x: int(x.split("/")[2]))

Another feature we want to extract is the TimeBin. We currently have access to the time of each incident under Time in format HH:MM, but if we want to perform time-based aggregation this is not very convenient.

Instead, what we want is to just keep the hour of the incident, and add it to the TimeBin. Like that we have an integer representation of the time which is easier to visualize and work with, and we can aggregate hourly data very easily.

# Function that takes a Time as HH:MM and returns an int representing the hour
def time_bin(t, bin_size=30):
    hour, mins = t.split(":")
    return int(hour)
    
sf["TimeBin"] = sf.Time.map(time_bin)

Now that we have all our new features, we just need to drop the columns that are not useful for us. We will drop the following columns:

IncidntNum: A SFPD internal ID which we can't make use of.
PdId: Another SFPD internal ID.
Location: This is just a combination of columns X and Y and is redundant.
Resolution: While this tells an interesting story (how the case concluded), there is no such data in seattle, so we will not use that.

sf.drop(["IncidntNum", "PdId", "Location", "Resolution"], axis=1, inplace=True, errors="ignore")

We now have everything we need in sf, we simply need to create a new DataFrame sf_normalized that will have a specific order of the columns. This is because seattle has a completely different order, so we want to make sure both DataFrames will have the exact same schema so we can fit them into a single DataFrame ultimately.

# Reindex the DataFrame
columns_index = ["City", "DayOfMonth", "DayOfWeek", "Month", "Year", "Time", "TimeBin",
                    "Longitude", "Latitude", "Address", "District", 
                    "Category", "Description"
                   ]
sf_normalized = sf[columns_index]
sf_normalized.sample(n=5)

Seattle¶

We need to go through the same process for seattle. As a first step, let's drop the following columns:

RMS CDW ID: A SeattlePD internal ID, no use for us.
General Offense Number: A unique number for each incident. Seems to be a function of the date, but we don't need it.
Offense Code: A unique number identifying each type of offense. This maps to Offense Type but is less expressive, so we get rid of it.
Offense Code Extension: Some ID, probably SeattlePD internal stuff.
Census Tract 2000: Census information which is not relevant for our analysis.
Location: This is just a combination of columns Longitude and Latitude and is redundant.

seattle_drop_columns = ["RMS CDW ID", 
                        "General Offense Number", 
                        "Offense Code", 
                        "Offense Code Extension", 
                        "Summary Offense Code",
                        "Census Tract 2000",
                        "Location"
                       ]
seattle.drop(seattle_drop_columns, axis=1, inplace=True, errors="ignore")

Next we rename the columns which have a different name in sf:

Hundred Block Location in seattle is similar to Address in sf.
Offense Type in seattle is similar to Category in sf.
We want to rename Summarized Offense Description to Description to make it simpler, like mentioned earlier.
We want to rename District/Sector to District to make it simpler, like mentioned earlier.

seattle.rename(columns = {"Hundred Block Location": "Address", "Offense Type": "Category", 
                          "Summarized Offense Description": "Description",
                          "District/Sector": "District"}, 
               inplace=True)

Finally we want to create new features for our analysis, similarly to what we did in sf:

City should always be seattle.
For temporal features, there are multiple choices. We could consider the occurrence date (Occurred Date or Date Range Start), or reported date (Date Reported). By looking at the data we see the date reported is sometimes many months after the alleged occurrence date, so we use the start range of the alleged occurred data for any temporal measure. From that we will create new features DayOfMonth, DayOfWeek, Time and TimeBin to match sf schema.

import datetime

seattle["City"] = pd.Series([SEATTLE]*seattle.shape[0], index=seattle.index)
seattle["DayOfMonth"] = seattle["Occurred Date or Date Range Start"].map(lambda x: int(x.split(" ")[0].split("/")[1]))
seattle["DayOfWeek"] = seattle["Occurred Date or Date Range Start"] \
    .map(lambda x: datetime.datetime.strptime(x.split(" ")[0], "%m/%d/%Y").strftime("%A"))
seattle["Time"] = seattle["Occurred Date or Date Range Start"] \
    .map(lambda x: datetime.datetime.strptime(x[11:], "%I:%M:%S %p").strftime("%H:%M"))
seattle["TimeBin"] = seattle.Time.map(time_bin)

Now we just need to reindex our DataFrame under a new one called seattle_normalized.

# Reindex the DataFrame
seattle_normalized = seattle[columns_index]
seattle_normalized.sample(n=5)

At that point, we have 2 DataFrames sf_normalized and seattle_normalized with matching schemas, but we're not done yet.

A particularly important column to analyze distribution of crime data is the Category. Unfortunately for us, Seattle and San Francisco police departments have very different ways to label incidents which don't match at all. For example, offenses related to drug trafficking are labelled DRUG/NARCOTIC in San Francisco, but in Seattle this is much more granular and there are more than 20 different labels for drug traffick offenses. The problem is that if we need to compare crime categories, if granularity is more different in 1 city then aggregated numbers won't mean anything.

As a result, we need to manually go through the crime categories and see how they map between the 2 cities. Since San Francisco categories are fewer and cover more ground, we will use San Francisco categories as the baseline, and try to map all of Seattle's crime categories into San Francisco categories namespace.

For example, we can say that Seattle's ASSLT-AGG-WEAPON matches San Francisco's ASSAULT category.

category_subset = {
    "RECKLESS BURNING": "ARSON",
    "THEFT-OTH": "LARCENY/THEFT",
    "BURGLARY-FORCE-NONRES": "BURGLARY",
    "INJURY - ACCIDENTAL": "OTHER OFFENSES",
    "ANIMAL-BITE": "OTHER OFFENSES",
    "ANIMAL-CRUELTY": "OTHER OFFENSES",
    "ANIMAL-OTH": "OTHER OFFENSES",
    "ASSLT-AGG-BODYFORCE": "ASSAULT",
    "ASSLT-AGG-GUN": "ASSAULT",
    "ASSLT-AGG-POLICE-BODYFORCE": "ASSAULT",
    "ASSLT-AGG-POLICE-GUN": "ASSAULT",
    "ASSLT-AGG-POLICE-WEAPON": "ASSAULT",
    "ASSLT-AGG-WEAPON": "ASSAULT",
    "ASSLT-NONAGG": "ASSAULT",
    "ASSLT-NONAGG-POLICE": "ASSAULT",
    "BIAS INCIDENT": "NON-CRIMINAL",
    "BURGLARY-FORCE-RES": "BURGLARY",
    "BURGLARY-NOFORCE-NONRES": "BURGLARY",
    "BURGLARY-NOFORCE-RES": "BURGLARY",
    "BURGLARY-SECURE PARKING-NONRES": "BURGLARY",
    "BURGLARY-SECURE PARKING-RES": "BURGLARY",
    "COUNTERFEIT": "FORGERY/COUNTERFEITING",
    "DISPUTE-CIVIL PROPERTY (AUTO)": "NON-CRIMINAL",
    "DISPUTE-CIVIL PROPERTY (NON AU": "NON-CRIMINAL",
    "DISPUTE-OTH": "NON-CRIMINAL",
    "DISTURBANCE-NOISE": "NON-CRIMINAL",
    "DISTURBANCE-OTH": "NON-CRIMINAL",
    "DRIVE-BY": "OTHER OFFENSES",
    "DUI-DRUGS": "DRIVING UNDER THE INFLUENCE",
    "DUI-LIQUOR": "DRIVING UNDER THE INFLUENCE",
    "ELUDING-FELONY FLIGHT": "RUNAWAY",
    "EMBEZZLE": "EMBEZZLEMENT",
    "ENDANGERMENT": "OTHER OFFENSES",
    "ESCAPE": "RUNAWAY",
    "FALSE REPORT": "OTHER OFFENSES",
    "FIREWORK-POSSESS": "OTHER OFFENSES",
    "FIREWORK-USE": "OTHER OFFENSES",
    "FORGERY-CHECK": "FORGERY/COUNTERFEITING",
    "FORGERY-CREDIT CARD": "FORGERY/COUNTERFEITING",
    "FORGERY-OTH": "FORGERY/COUNTERFEITING",
    "FRAUD-CHECK": "FRAUD",
    "FRAUD-COMPUTER": "FRAUD",
    "FRAUD-CREDIT CARD": "FRAUD",
    "FRAUD-IDENTITY THEFT": "FRAUD",
    "FRAUD-OTHER": "FRAUD",
    "FRAUD-WIRE-ELECTRONIC": "FRAUD",
    "HARASSMENT": "DISORDERLY CONDUCT",
    "HOMICIDE-JUST-GUN": "ASSAULT",
    "HOMICIDE-JUST-WEAPON": "ASSAULT",
    "HOMICIDE-PREMEDITATED-GUN": "ASSAULT",
    "ILLEGAL DUMPING": "NON-CRIMINAL",
    "INJURY - OTHER": "OTHER OFFENSES",
    "LIQUOR LAW VIOLATION": "LIQUOR LAWS",
    "MALICIOUS HARASSMENT": "DISORDERLY CONDUCT",
    "NARC-DRUG TRAFFIC LOITERING": "DRUG/NARCOTIC",
    "NARC-EQUIPMENT/PARAPHENALIA": "DRUG/NARCOTIC",
    "NARC-FORGERY-PRESCRIPTION": "DRUG/NARCOTIC",
    "NARC-FOUND-AMPHETAMINE": "DRUG/NARCOTIC",
    "NARC-FOUND-COCAINE": "DRUG/NARCOTIC",
    "NARC-FOUND-HEROIN": "DRUG/NARCOTIC",
    "NARC-FOUND-MARIJU": "DRUG/NARCOTIC",
    "NARC-FOUND-METH": "DRUG/NARCOTIC",
    "NARC-FOUND-OPIUM": "DRUG/NARCOTIC",
    "NARC-FOUND-OTHER": "DRUG/NARCOTIC",
    "NARC-FOUND-SYNTHETIC": "DRUG/NARCOTIC",
    "NARC-FRAUD-PRESCRIPTION": "DRUG/NARCOTIC",
    "NARC-POSSESS-AMPHETAMINE": "DRUG/NARCOTIC",
    "NARC-POSSESS-COCAINE": "DRUG/NARCOTIC",
    "NARC-POSSESS-HALLUCINOGEN": "DRUG/NARCOTIC",
    "NARC-POSSESS-HEROIN": "DRUG/NARCOTIC",
    "NARC-POSSESS-MARIJU": "DRUG/NARCOTIC",
    "NARC-POSSESS-METH": "DRUG/NARCOTIC",
    "NARC-POSSESS-OTHER": "DRUG/NARCOTIC",
    "NARC-POSSESS-PILL/TABLET": "DRUG/NARCOTIC",
    "NARC-PRODUCE-MARIJU": "DRUG/NARCOTIC",
    "NARC-SELL-AMPHETAMINE": "DRUG/NARCOTIC",
    "NARC-SELL-COCAINE": "DRUG/NARCOTIC",
    "NARC-SELL-HEROIN": "DRUG/NARCOTIC",
    "NARC-SELL-MARIJU": "DRUG/NARCOTIC",
    "NARC-SELL-METH": "DRUG/NARCOTIC",
    "NARC-SELL-SYNTHETIC": "DRUG/NARCOTIC",
    "NARC-SMUGGLE-OTHER": "DRUG/NARCOTIC",
    "OBSTRUCT": "OTHER OFFENSES",
    "PORNOGRAPHY-OBSCENE MATERIAL": "PORNOGRAPHY/OBSCENE MAT",
    "PROP RECOVERED-OTHER AGENCY": "STOLEN PROPERTY",
    "PROPERTY DAMAGE - GRAFFITI": "VANDALISM",
    "PROPERTY DAMAGE-NON RESIDENTIA": "VANDALISM",
    "PROPERTY DAMAGE-RESIDENTIAL": "VANDALISM",
    "PROPERTY FOUND": "STOLEN PROPERTY",
    "PROPERTY LOST": "STOLEN PROPERTY",
    "PROPERTY LOST - POLICE EQUIPME": "STOLEN PROPERTY",
    "PROPERTY STOLEN-POSSESS": "STOLEN PROPERTY",
    "PROPERTY STOLEN-SELL": "STOLEN PROPERTY",
    "PROPERTY STOLEN-TRAFFICKING": "STOLEN PROPERTY",
    "PROSTITUTION LOITERING": "PROSTITUTION",
    "PROSTITUTION PATRONIZING": "PROSTITUTION",
    "PROSTITUTION-ASSIST-PROMOTE": "PROSTITUTION",
    "ROBBERY-BANK-BODYFORCE": "ROBBERY",
    "ROBBERY-BANK-GUN": "ROBBERY",
    "ROBBERY-BANK-WEAPON": "ROBBERY",
    "ROBBERY-BUSINESS-BODYFORCE": "ROBBERY",
    "ROBBERY-BUSINESS-GUN": "ROBBERY",
    "ROBBERY-BUSINESS-WEAPON": "ROBBERY",
    "ROBBERY-RESIDENCE-BODYFORCE": "ROBBERY",
    "ROBBERY-RESIDENCE-GUN": "ROBBERY",
    "ROBBERY-RESIDENCE-WEAPON": "ROBBERY",
    "ROBBERY-STREET-BODYFORCE": "ROBBERY",
    "ROBBERY-STREET-GUN": "ROBBERY",
    "ROBBERY-STREET-WEAPON": "ROBBERY",
    "THEFT OF SERVICES": "LARCENY/THEFT",
    "THEFT-AUTO PARTS": "LARCENY/THEFT",
    "THEFT-AUTOACC": "LARCENY/THEFT",
    "THEFT-BICYCLE": "LARCENY/THEFT",
    "THEFT-BOAT": "LARCENY/THEFT",
    "THEFT-BUILDING": "LARCENY/THEFT",
    "THEFT-CARPROWL": "LARCENY/THEFT",
    "THEFT-COINOP": "LARCENY/THEFT",
    "THEFT-LICENSE PLATE": "LARCENY/THEFT",
    "THEFT-MAIL": "LARCENY/THEFT",
    "THEFT-PKPOCKET": "LARCENY/THEFT",
    "THEFT-PRSNATCH": "LARCENY/THEFT",
    "THEFT-SHOPLIFT": "LARCENY/THEFT",
    "THREATS-KILL": "ASSAULT",
    "THREATS-OTHER": "OTHER OFFENSES",
    "THREATS-WEAPON": "ASSAULT",
    "TRAFFIC": "OTHER OFFENSES",
    "URINATING/DEFECATING-IN PUBLIC": "DISORDERLY CONDUCT",
    "VEH-RCVD-FOR OTHER AGENCY": "VEHICLE THEFT",
    "VEH-THEFT-AUTO": "VEHICLE THEFT",
    "VEH-THEFT-MTRCYCLE": "VEHICLE THEFT",
    "VEH-THEFT-OTHVEH": "VEHICLE THEFT",
    "VEH-THEFT-TRAILER": "VEHICLE THEFT",
    "VEH-THEFT-TRUCK": "VEHICLE THEFT",
    "VIOL-COURT ORDER": "OTHER OFFENSES",
    "WARRANT-FUGITIVE": "WARRANTS",
    "WARRARR-FELONY": "WARRANTS",
    "WARRARR-MISDEMEANOR": "WARRANTS",
    "WEAPON-CONCEALED": "WEAPON LAWS",
    "WEAPON-DISCHARGE": "WEAPON LAWS",
    "WEAPON-POSSESSION": "WEAPON LAWS",
    "WEAPON-SELLING": "WEAPON LAWS",
    "WEAPON-SURRENDER-EXCLUDING FIR": "WEAPON LAWS",
    "WEAPON-UNLAWFUL USE": "WEAPON LAWS",
    "[INC - CASE DC USE ONLY]": "OTHER OFFENSES"
}
for category, subset in category_subset.iteritems():
    seattle_normalized.loc[seattle_normalized["Category"] == category, "Category"] = subset

sf_categories = set(sf_normalized["Category"].tolist())
seattle_categories = set(seattle_normalized["Category"].tolist())
print "SF crime categories: %s" % ",".join(sorted(sf_categories))
print "Seattle crime categories: %s" % ",".join(sorted(seattle_categories))
assert seattle_categories <= sf_categories

SF crime categories: ARSON,ASSAULT,BRIBERY,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,DRUNKENNESS,EMBEZZLEMENT,EXTORTION,FAMILY OFFENSES,FORGERY/COUNTERFEITING,FRAUD,GAMBLING,KIDNAPPING,LARCENY/THEFT,LIQUOR LAWS,LOITERING,MISSING PERSON,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,ROBBERY,RUNAWAY,SECONDARY CODES,STOLEN PROPERTY,SUICIDE,SUSPICIOUS OCC,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS
Seattle crime categories: ARSON,ASSAULT,BURGLARY,DISORDERLY CONDUCT,DRIVING UNDER THE INFLUENCE,DRUG/NARCOTIC,EMBEZZLEMENT,FORGERY/COUNTERFEITING,FRAUD,LARCENY/THEFT,LIQUOR LAWS,NON-CRIMINAL,OTHER OFFENSES,PORNOGRAPHY/OBSCENE MAT,PROSTITUTION,ROBBERY,RUNAWAY,STOLEN PROPERTY,TRESPASS,VANDALISM,VEHICLE THEFT,WARRANTS,WEAPON LAWS

Finally we have sf_normalized and seattle_normalized with a common schema and common category namespace. At that point we have everything we need, and we just need to create a final DataFrame called dataset which will be the concatenation of both sf_normalized and seattle_normalized.

dataset = pd.concat([sf_normalized, seattle_normalized])
total_nrows = dataset.shape[0]
print "Total number of rows after union: %d" % total_nrows
assert total_nrows == sf_nrows + seattle_nrows

dataset.sample(n=5)

Total number of rows after union: 61772

Data Exploration¶

Temporal Analysis - Crimes by Time of Day in San Francisco¶

Now that we have our dataset DataFrame, we are ready to explore our data. Like mentioned at the beginning, our goal is to look at the distribution of crimes at night, so as a first step we need to figure out which crimes are more frequent at night.

We are restricting ourselves to San Francisco data only at that point to avoid bloating our visualizations with too much data, and we will compare with Seattle at the end of this analysis.

To visualize data in an efficient manner, we will be using Python's matplotlib library which offers some very powerful capabilities. In order to show visualizations inline in this notebook, we need to use the following macro:

%matplotlib inline

Let's import all the matplotlib packages we will be using:

%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colorbar as cbar

For this temporal representation, the idea is that we want to show 3 variables:

The crime category, based on the column Category.
The time of day, based on the column TimeBin (not Time because we wouldn't be able to run any aggregation).
The number of incidents in that category and time.

Initially we could think about doing this via a 3d-plot, but this visualization would not be very efficient as we wouldn't evaluate distances very easily and this could be misleading.

Instead, a more efficient representation would be to have a grid where category and hour are on the axes, and each grid cell will be a color whose intensity will represent the proportion of crimes in that category and hour. This will make it easy to see for a given category, for which hours there are more occurrences.

As a first step, we need to take all of San Francisco's data in dataset and group by TimeBin and Category. This will tie all the crimes to their respective TimeBin/Category combination.

# select data from SF and group by TimeBin (hour) and crime Category
gb = dataset[dataset["City"] == SF].groupby(["TimeBin", "Category"])

From there we want to create a matrix as a numpy.array where the x-axis is the hour, the y-axis is the category, and the cell value is the number of crimes for that category and hour.

import numpy as np

# sorted since it is more pleasing visually
categories = sorted(list(set(dataset.Category.tolist())))
# number of categories
n_categories = len(categories)
# matrix where the x-axis is the hour, and y-axis is the category
img_src = np.zeros((n_categories, 24))

# add the frequencies of crime per hour in the matrix
for group, values in gb:
    timebin = group[0]
    category = group[1]
    value = values.shape[0]
    img_src[categories.index(category)][timebin] = value

We have our matrix with the number of crimes per category/hour, but this is not enough to have a meaningful representation.

If we keep the values being simply the number of crimes, then categories with a high overall amount of occurrences (like LARCENY/THEFT or ASSAULT will overshadow all the other categories, so we would only see the crimes with high occurrences in our grid.

To solve it, we can normalize our matrix by row. If we divide each row element by the max value for that row, all row values will be between 0 and 1 and represent the temporal distribution of occurrences for that crime category. A number closer to 0 means that this particular hour saw a low number of crimes in that category, while a number close to 1 means that this particular hour saw a high number of crimes in that category.

# normalizing frequencies between 0 and 1 based on the max of each row
for row in img_src:
    m = max(row)
    for idx in range(len(row)):
        row[idx] = row[idx] / m

Now we are ready to plot our matrix in a grid. For that, we need to first create a grid layout. This can be done with the grid function in matplotlib. We create a small helper function that we can reuse later if needed.

def preparePlot(xticks, yticks, figsize=(10.5, 6), hideLabels=False, gridColor='#999999',
                gridWidth=1.0):
    plt.close()
    fig, ax = plt.subplots(figsize=figsize, facecolor='white', edgecolor='white')
    ax.axes.tick_params(labelcolor='#999999', labelsize='10')
    for axis, ticks in [(ax.get_xaxis(), xticks), (ax.get_yaxis(), yticks)]:
        axis.set_ticks_position('none')
        axis.set_ticks(ticks)
        axis.label.set_color('#999999')
        if hideLabels: axis.set_ticklabels([])
    plt.grid(color=gridColor, linewidth=gridWidth, linestyle='-')
    map(lambda position: ax.spines[position].set_visible(False), ['bottom', 'top', 'left', 'right'])
    return fig, ax

At that point we can use matplotlib's imshow method which takes a matrix and will plot it according to some parameters:

The interpolation parameter allows us to interpolate each "pixel" of the image based on a specified algorithm. Here we use nearest so that each pixel fits into each cell of the grid.
aspect controls the image aspect-ratio, we leave it to auto to let matplotlib handle it.
And cmap represents the colormap used to display the pixel intensity. We use cm.Greys which is the black and white colormap, meaning a pixel close to 0 will produce a white color, while a pixel close to 1 will produce a black color.

# draw grid on axes
fig, ax = preparePlot(np.arange(.5, 23, 1), np.arange(.5, n_categories-1, 1), figsize=(16,14), hideLabels=True,
                      gridColor='#eeeeee', gridWidth=1.1)

# interpolate crime intensity per hour on the grid
image = plt.imshow(img_src, interpolation='nearest', aspect='auto', cmap=cm.Greys)

# x-axis labels
for x, y, s in zip(np.arange(-.125, 24, 1), np.repeat(-.75, 24), [str(x) for x in range(24)]):
    plt.text(x, y, s, color='#999999', size='10')
# y-axis labels
for x, y, s in zip(np.repeat(-.75, n_categories), np.arange(.125, n_categories, 1), categories):
    plt.text(x, y, s, color='#999999', size='10', horizontalalignment="right")
plt.title("Distribution of crimes per category depending on time of day in San Francisco", size=20, y=1.05)
plt.xlabel("Hour", color='#999999', size="20")
plt.ylabel("Crime Category", color='#999999', size="20")
ax.yaxis.set_label_position("right")

# plot the colobar to show scale
cbar = fig.colorbar(image, ticks=[0, 0.5, 1], shrink=0.25, orientation='vertical')
cbar.ax.set_yticklabels(['Low', 'Medium', 'High'])

plt.show()

From that plot, a few interesting properties emerge:

A common property to most crime categories is that between 4am and 6am the number of crime occurrences drops drastically. This can easily be explained by the fact that people are mostly sleeping during that time, so it is to be expected.
Some crimes are most common during the day between 10am and 6pm, while others are most commong during the evening/night between 6pm and 6am.

Since our goal here is to look at crimes happening at night, let's take a closer look at the crimes we can identify:

Occurrences of ASSAULT crimes are constant overall throughout the day, but there is a distinct increase after 4pm, a peak around 10pm and then it goes back to normal after 2am.
DRUNKENNESS has a low number of occurrences in the day before 4pm, then steadily increases and peaks around 10pm before going back to normal around 3am.
For LARCENY/THEFT crimes, the number of occurrences gradually, almost linearly, increases throughout the day, reaching critical mass around 5pm, peaking around 7pm, and then going down around midnight-1am.
PROSTITUTION seems to be almost non-existent during the day, but there are some distinct occurrences at night, especially close to midnight.
There is a pretty clear pattern for ROBBERY: the number of occurrences starts growing at 5-6pm, peaking around midnight, and falling flat around 2am. This makes sense, robberies are more likely to happen when shops are closed.
VEHICLE THEFT seems to be mostly concentrated between 5pm and midnight. This makes sense intuitively, people are more likely to steal cars when it is dark and when their owners are at home after work.

Geospatial Analysis - Crimes by San Francisco Neighborhood¶

Now we have identified which crimes are more common at night, and we want to focus on those and dig further.

In particular, it would be interesting to see how these crimes are distributed geographically, and see if we can create a map of San Francisco's criminality at night and which neighborhoods to avoid. We have the Longitude and Latitude columns in our dataset DataFrame which can help us, but we don't really have a map of San Francisco.

Plotting geospatial data is not a trivial topic, thankfully several libraries exist in Python that we use in this analysis:

matplotlib's basemap toolkit: This is useful to plot 2D map data, but doesn't provide much facility around that.
fiona is a Python interface to OGR. We use it to open Shapefiles representing San Francisco.
shapely is a library for analyzing and manipulating planar geometric objects in Python. In particular, it is useful to create polygons and points, which for us maps to neighborhoods and crime occurrences.
descartes is a Python library allowing us to manipulate geospatial shapes. This is useful to turn shapely polygons into matplotlib patches to be represented on maps.
PySAL is a spatial analysis library - we use it here to cluster neighborhoods based on the number of crimes.

Some interesting analyses that I drew from:

Let's import all of these modules right now to make sure all is working - some of these have complex and long installation procedures, so we should check they are working before starting:

from matplotlib.colors import Normalize
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap

from descartes import PolygonPatch

from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon
from shapely.prepared import prep

from itertools import chain

import fiona

Before we can get started, we need an extra dataset aside from the San Francisco crime data.

In order to draw a map of San Francisco, the easiest way is to leverage fiona and use a shapefile of San Francisco. One such shapefile is available on SF's open-data portal and has been downloaded in this repository under data/SFFind_Neighborhoods

Let's read it with Fiona:

SF_NEIGHBORHOODS = "sffind_neighborhoods"
sf_neighborhoods = os.path.join(DATA_DIR, "%s.shp" % SF_NEIGHBORHOODS)
shp = fiona.open(sf_neighborhoods)

Now that we have successfully read the shapefile, we need to pre-compute some statistics which will be used to construct our map:

Map boundaries: used to define what to display.
Width, height: used to center the plot.

# Extract map boundaries
bds = shp.bounds
shp.close()
extra = 0.01
ll = (bds[0], bds[1])
ur = (bds[2], bds[3])
coords = list(chain(ll, ur))
# width, height of map
w, h = coords[2] - coords[0], coords[3] - coords[1]

Now we have enough data to create a matplotlib Basemap object which is the foundation of our map. Here we need to specify some geospatial parameters like which projection to use, as well as latitude and longitude coordinates for our map and where it should be centered.

On top of that, we overlay the content of our shapefile so we have the starting point for our Basemap.

m = Basemap(
    projection='tmerc',
    lon_0=-122.,
    lat_0=37.7,
    ellps = 'WGS84',
    llcrnrlon=coords[0] - extra * w,
    llcrnrlat=coords[1] - extra + 0.01 * h,
    urcrnrlon=coords[2] + extra * w,
    urcrnrlat=coords[3] + extra + 0.01 * h,
    lat_ts=0,
    resolution='i',
    suppress_ticks=True)

m.readshapefile(
    os.path.join(DATA_DIR, SF_NEIGHBORHOODS),
    'SF',
    color='none',
    zorder=2)

print "Read SF neighborhood data into Basemap"

Read SF neighborhood data into Basemap

Next we need to create a DataFrame representing our map data on which we can apply operations. This DataFrame will contains Polygon objects via shapely where each Polygon represents a San Francisco neighborhood.

In this DataFrame we also want to include geographic information like the area size in meters and kilometers, because ultimately we will want to represent the crime density by neighborhood.

df_map = pd.DataFrame({
    'poly': [Polygon(xy) for xy in m.SF],
    'ward_name': [ward['name'] for ward in m.SF_info]})
df_map['area_m'] = df_map['poly'].map(lambda x: x.area)
df_map['area_km'] = df_map['area_m'] / 100000
# Draw neighborhoods with polygons
df_map['patches'] = df_map['poly'].map(lambda x: PolygonPatch(
    x,
    fc='#000000',
    ec='#ffffff', lw=.5, alpha=1,
    zorder=4))

df_map.sample(n=5)

Finally, we need some data to plot ! We can use our dataset DataFrame and convert the Longitude and Latitude columns to shapely Point and MultiPoint which will be the basis for drawing on our Basemap.

We need to make sure that what we select the right data from our dataset DataFrame before transforming it to Points:

Only San Francisco data.
Category should be within the list of 6 crime categories we have identified as happening mostly at night (ASSAULT, DRUNKENNESS, LARCENY/THEFT, PROSTITUTION, ROBBERY, VEHICLE THEFT)
Occurrences should have a time between 6pm and 6am (this is what we consider as evening/night in this analysis)

def makePoints(dat):
    # Create Point objects in map coordinates from dataframe lon and lat values
    map_points = pd.Series([Point(m(mapped_x,mapped_y)) for mapped_x, mapped_y in zip(dat['Longitude'],dat['Latitude'])])
    plt_points = MultiPoint(list(map_points.values))
    hoods_polygon = prep(MultiPolygon(list(df_map['poly'].values)))
    pts = filter(hoods_polygon.contains,plt_points)
    return pts

night_categories = [
    "VEHICLE THEFT", 
    "ROBBERY", 
    "PROSTITUTION", 
    "LARCENY/THEFT",
    "DRUNKENNESS",
    "ASSAULT"
]

sf_night_crimes = dataset[
                            (dataset["City"] == SF) & 
                            (
                                (dataset["TimeBin"] >= 18) | 
                                (dataset["TimeBin"] <= 6)
                            ) & 
                            (dataset["Category"].isin(night_categories))]
sf_night_vehicle_theft = sf_night_crimes[sf_night_crimes["Category"] == night_categories[0]]
sf_night_robbery = sf_night_crimes[sf_night_crimes["Category"] == night_categories[1]]
sf_night_prostitution = sf_night_crimes[sf_night_crimes["Category"] == night_categories[2]]
sf_night_theft = sf_night_crimes[sf_night_crimes["Category"] == night_categories[3]]
sf_night_drunk = sf_night_crimes[sf_night_crimes["Category"] == night_categories[4]]
sf_night_assault = sf_night_crimes[sf_night_crimes["Category"] == night_categories[5]]

sf_night_crimes_points = makePoints(sf_night_crimes)
sf_night_vehicle_theft_points = makePoints(sf_night_vehicle_theft)
sf_night_robbery_points = makePoints(sf_night_robbery)
sf_night_prostitution_points = makePoints(sf_night_prostitution)
sf_night_theft_points = makePoints(sf_night_theft)
sf_night_drunk_points = makePoints(sf_night_drunk)
sf_night_assault_points = makePoints(sf_night_assault)

Let's recap where we are at this point:

We have our Basemap instance created, with our neighborhood shapefile overlay.
We have a list of Points for each of our night crime.

All that is left is to scatter the individual crime Points in the map and render them on top of the Basemap to have a geospatial visualization of our crime occurrences at night.

Some considerations used here to produce a visually pleasing visualization:

Categories like PROSTITUTION, DRUNKENNESS and ROBBERY which are less frequent have a bigger scatter plot size so they don't disappear under the way more frequent ASSAULT, LARCENY/THEFT and VEHICLE THEFT.
LARCENY/THEFT has quite a bit of overlap with the other categories, so it is represented with a cross (x) instead of a dot (o) like the others, in order to being able to distinguish every category.

plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, axisbg='w', frame_on=False)

# Scatter vehicle theft occurrences
dev = m.scatter(
    [geom.x for geom in sf_night_vehicle_theft_points],
    [geom.y for geom in sf_night_vehicle_theft_points],
    10, marker='o', lw=.25,
    facecolor='cyan', edgecolor='cyan',
    alpha=0.75, antialiased=True,
    label='Vehicle Theft', zorder=3)

# Scatter robbery occurrences
dev = m.scatter(
    [geom.x for geom in sf_night_robbery_points],
    [geom.y for geom in sf_night_robbery_points],
    10, marker='o', lw=5.5,
    facecolor='red', edgecolor='red',
    alpha=0.75, antialiased=True,
    label='Robbery', zorder=3)

# Scatter prostitution occurrences
dev = m.scatter(
    [geom.x for geom in sf_night_prostitution_points],
    [geom.y for geom in sf_night_prostitution_points],
    10, marker='o', lw=5.5,
    facecolor='green', edgecolor='green',
    alpha=0.75, antialiased=True,
    label='Prostitution', zorder=3)

# Scatter larceny/theft occurrences
dev = m.scatter(
    [geom.x for geom in sf_night_theft_points],
    [geom.y for geom in sf_night_theft_points],
    10, marker='x', lw=.75,
    facecolor='magenta', edgecolor='magenta',
    alpha=0.75, antialiased=True,
    label='Larceny/Theft', zorder=3)

# Scatter drunkenness occurrences
dev = m.scatter(
    [geom.x for geom in sf_night_drunk_points],
    [geom.y for geom in sf_night_drunk_points],
    10, marker='o', lw=5.5,
    facecolor='yellow', edgecolor='yellow',
    alpha=0.75, antialiased=True,
    label='Drunkenness', zorder=3)

# Scatter assault occurrences
dev = m.scatter(
    [geom.x for geom in sf_night_assault_points],
    [geom.y for geom in sf_night_assault_points],
    10, marker='o', lw=.25,
    facecolor='orange', edgecolor='orange',
    alpha=0.75, antialiased=True,
    label='Assault', zorder=3)

ax.add_collection(PatchCollection(df_map['patches'].values, match_original=True))

m.drawmapscale(
    coords[0] + 0.015, coords[1] - 0.005,
    coords[0], coords[1],
    2.,
    barstyle='fancy', labelstyle='simple',
    fillcolor1='w', fillcolor2='#555555',
    fontcolor='#555555',
    zorder=5)

plt.tight_layout()
plt.legend(loc="center right")
plt.title("Geographic distribution of most prevalent crimes at night in San Francisco during Summer 2014", size=15)
fig.set_size_inches(15,15)
plt.show()

<matplotlib.figure.Figure at 0x11a8dd650>

So what does this map tell us? We can make plenty of observations now that we have a geospatial representation of crimes and their neighborhoods:

A lot of crime activity is concentrated in the city center near Union Square, for all of the categories identified here.
ASSAULT crimes seem to be particularly prevalent in the Tenderloin neighborhood. This makes sense, since the Tenderloin has a reputation to be a pretty sketchy neighborhood.
PROSTITUTION is not found so much in the city center, but seems to be more concentrated around the Mission district. This also makes sense, because Capp Street has a reputation of having many prostitutes, so the data correlates with this story.
For ROBBERY occurrences, they are pretty sparsely distributed accross the city and there is no real epicenter. But what we can clearly see is that the further West we go, the fewer robberies there are.
VEHICLE THEFT and LARCENY/THEFT are happening pretty much uniformly everywhere in the city. The frequency is of course higher in the city center, but this is more due to it being more densely populated than anything else.
Based on the occurrences of DRUNKENNESS, you can find more drunk people alongside the main big streets like Market Street. This makes sense because this is probably where most bars are located.

This was a useful visualization, but we're not done yet. One of our goals was to see which neighborhoods to avoid at night. And, while we can start forming a pretty good idea at that point, it would be nice to cluster the different neighborhoods based on their crime density, and represent that on the map.

To that end we will use a classification algorithm called Jenks Natural Breaks Optimization which will take all our neighborhoods and their crime densities, and divide these into several crime categories.

The first step is to take our night crime data (for all 6 night crime categories), and identify to which neighborhood they belong, count those, and derive the crime density from it:

df_map['count'] = df_map['poly'].map(lambda x: int(len(filter(prep(x).contains, sf_night_crimes_points))))
df_map['density_m'] = df_map['count'] / df_map['area_m']
df_map['density_km'] = df_map['count'] / df_map['area_km']
# it's easier to work with NaN values when classifying
df_map.replace(to_replace={'density_m': {0: np.nan}, 'density_km': {0: np.nan}}, inplace=True)

df_map.sample(n=5)

Next we use PySAL which has an implementation of the Natural Breaks algorithm, and apply it to the crime density per square kilometer, and ask the algorithm to divide the neighborhoods in 5 classes.

from pysal.esda.mapclassify import Natural_Breaks as nb

# Calculate Jenks natural breaks for density
breaks = nb(
    df_map[df_map['density_km'].notnull()].density_km.values,
    initial=300,
    k=5)

prev = 0
for density, counts in zip(breaks.bins, breaks.counts):
    print "%d neighborhoods have crime density D such as %d/km^2 <= D <= %d/km^2" % (counts, prev, density)
    prev = density

36 neighborhoods have crime density D such as 0/km^2 <= D <= 2/km^2
21 neighborhoods have crime density D such as 2/km^2 <= D <= 4/km^2
26 neighborhoods have crime density D such as 4/km^2 <= D <= 9/km^2
22 neighborhoods have crime density D such as 9/km^2 <= D <= 24/km^2
8 neighborhoods have crime density D such as 24/km^2 <= D <= 86/km^2

From that, we join back the results of the Natural Breaks algorithm with our map DataFrame so that we know for each neighborhood what its class is. This is all we need in terms of data, based on that we will be able to assign different colors to each neighborhood based on their class.

We also want to create the labels for the color scale based on the density values found by the algorithm, so we can show in the map to what density each color corresponds.

# the notnull method lets us match indices when joining
jb = pd.DataFrame({'jenks_bins': breaks.yb}, index=df_map[df_map['density_km'].notnull()].index)
df_map.drop("jenks_bins", inplace=True, axis=1, errors="ignore")
df_map = df_map.join(jb)
df_map.jenks_bins.fillna(-1, inplace=True)

# assign labels to the colorbar based on the natural breaks bins
jenks_labels = ["<= %0.1f crimes / km$^2$" % b for b in breaks.bins]
jenks_labels.insert(0, 'No crimes')

Finally all that's left to do is to draw the map and assign the right colors ! We will use the Blues colormap which is more aesthetically pleasing.

from matplotlib.colors import Normalize, LinearSegmentedColormap 

plt.clf()
fig = plt.figure()
ax = fig.add_subplot(111, axisbg='w', frame_on=False)

# use a blue colour ramp - we'll be converting it to a map using cmap()
cmap = plt.get_cmap('Blues')
pc = PatchCollection(df_map['patches'], match_original=True)
# impose our colour map onto the patch collection
norm = Normalize()
pc.set_facecolor(cmap(norm(df_map['jenks_bins'].values)))
ax.add_collection(pc)

# Convenience functions for working with colour ramps and bars
def colorbar_index(ncolors, cmap, labels=None, **kwargs):
    cmap = cmap_discretize(cmap, ncolors)
    mappable = cm.ScalarMappable(cmap=cmap)
    mappable.set_array([])
    mappable.set_clim(-0.5, ncolors+0.5)
    colorbar = plt.colorbar(mappable, **kwargs)
    colorbar.set_ticks(np.linspace(0, ncolors, ncolors))
    colorbar.set_ticklabels(range(ncolors))
    if labels:
        colorbar.set_ticklabels(labels)
    return colorbar

def cmap_discretize(cmap, N):
    if type(cmap) == str:
        cmap = get_cmap(cmap)
    colors_i = np.concatenate((np.linspace(0, 1., N), (0., 0., 0., 0.)))
    colors_rgba = cmap(colors_i)
    indices = np.linspace(0, 1., N + 1)
    cdict = {}
    for ki, key in enumerate(('red', 'green', 'blue')):
        cdict[key] = [(indices[i], colors_rgba[i - 1, ki], colors_rgba[i, ki]) for i in xrange(N + 1)]
    return LinearSegmentedColormap(cmap.name + "_%d" % N, cdict, 1024)

# Add a colour bar
cb = colorbar_index(ncolors=len(jenks_labels), cmap=cmap, shrink=0.5, labels=jenks_labels)
cb.ax.tick_params(labelsize=10)

# Show highest densities, in descending order
highest = '\n'.join(
    value[1] for _, value in df_map[(df_map['jenks_bins'] == 4)][:10].sort().iterrows())
highest = 'Most criminal neighborhoods:\n\n' + highest
# Subtraction is necessary for precise y coordinate alignment
details = cb.ax.text(
    -1., 0 + 0.3,
    highest,
    ha='right', va='bottom',
    size=10,
    color='#555555')

# Draw a map scale
m.drawmapscale(
    coords[0] + 0.015, coords[1] - 0.005,
    coords[0], coords[1],
    2.,
    barstyle='fancy', labelstyle='simple',
    fillcolor1='w', fillcolor2='#555555',
    fontcolor='#555555',
    zorder=5)
plt.tight_layout()
plt.title("Crime density at night by San Francisco neighborhood during Summer 2014", size=15)
fig.set_size_inches(15, 15)
plt.show()

<matplotlib.figure.Figure at 0x11d9e3b10>

This map is much easier to parse visually than the previous map. It contains less information, but information is aggregated and tells a clearer story with no need for interpretation.

Here we can see that the most criminal neighborhoods are, for most of them, right in the city center. This can be explained by the fact that this area is more densely populated, and also where most activities at night happen, so this was expected.

What is interesting to notice though is that, even though they are both far from the city center, Western San Francisco (The Avenues + Twin Peaks) and SouthEast districts have very different crime patterns: there are way more crimes in the SouthEast, than there are in the West.

Comparison between San Francisco and Seattle crime distribution¶

At that point we have a pretty good understanding of crime activity at night in San Francisco, both temporally and spacially. What would now be interesting to see is if there is any correlation between what we've found in San Francisco, and crime data in Seattle.

We already did the groundwork during pre-processing to normalize Seattle and San Francisco's datasets, so it is now a simple matter of visualizing what we previously did.

An interesting way to show the differences in crime distribution between the 2 cities is to use hierarchical clustering on normalized crime occurrences broken down by category. Thankfully Python's library seaborn provides powerful visualization techniques and, among those, a clustermap which essentially uses hierarchical clustering to discover relationships between rows and columns in a DataFrame.

The first step is to prepare our data: we want to create a DataFrame where each row represents a category, and each column a different city. A few extra processing steps:

If a crime category doesn't have an occurrence in both cities, we take it out. It wouldn't be meaningful to include it since we are comparing both cities.
We need to normalize the values accross rows so that crimes with very high occurrences like LARCENY/THEFT don't end up alone in a cluster and everything else in a separate cluster. The sum of each row should be 1.

# only keep night crimes
gb = dataset[(dataset["TimeBin"] >= 18) | (dataset["TimeBin"] <= 6)].groupby("Category")
category_by_city = gb["City"].value_counts().unstack().fillna(0)
# filter out crimes which don't have an occurrence in both cities
category_by_city_nonnull = category_by_city[(category_by_city[SF] > 0) & (category_by_city[SEATTLE] > 0)]
print "Crime categories not in both cities:"
print category_by_city[(category_by_city[SF] == 0) | (category_by_city[SEATTLE] == 0)]
# normalize the distribution of crimes so that each rows adds up to 1
category_by_city_norm = category_by_city_nonnull.div(category_by_city_nonnull.sum(axis=1), axis=0)

category_by_city_norm.sample(n=5)

Crime categories not in both cities:
                 sanfrancisco  seattle
Category                              
BRIBERY                     1        0
DRUNKENNESS                98        0
EXTORTION                   4        0
FAMILY OFFENSES             3        0
KIDNAPPING                 63        0
LOITERING                   1        0
MISSING PERSON            401        0
SECONDARY CODES           216        0
SUICIDE                     7        0
SUSPICIOUS OCC            562        0

We are now ready to plot our clustermap, still using the Blues colormap for consistency with what we did previously:

import seaborn as sns

g = sns.clustermap(category_by_city_norm, annot=True, cmap="Blues", fmt="g")
plt.setp(g.ax_heatmap.yaxis.get_majorticklabels(), rotation=0)
g.fig.suptitle("Normalized distribution and clustering of crimes between San Francisco and Seattle", size=15)
plt.show()

The clustermap has rearranged the order of the categories to exhibit the different clusters. We can quite clearly here see 3 different clusters emerge:

A cluster where the proportion of occurrences is roughly the same between the 2 cities.
A cluster where the proportion of occurrences is significantly higher in Seattle.
A cluster where the proportion of occurrences is significantly higher in San Francisco.

Let's look at our night categories again:

ASSAULT: 0.54 (SF) ~= 0.46 (Seattle)
DRUNKENNESS: not in Seattle, probably due to the way we manually labelled the Seattle categories, or could already be labbeled by SeattlePD under different categories.
LARCENY/THEFT: 0.43 (SF) ~= 0.57 (Seattle)
PROSTITUTION: 0.31 (SF) < 0.69 (Seattle)
ROBBERY: 0.28 (SF) < 0.72 (Seattle)
VEHICLE THEFT: 0.38 (SF) < 0.62 (Seattle)

ASSAULT and LARCENY/THEFT are by far the most frequent crime categories (see previous map to convince yourself), so it makes sense that they are roughly similar between the 2 cities because it is statistically significant. For the other categories however they are far lower in San Francisco than in Seattle. There are multiple ways to explain that:

We got about 4,000 more crime entries in the initial datasets for Seattle than in San Francisco.
We manually re-labelled the crime categories in Seattle, and might have inflated some categories numbers.

Conclusion¶

In this analysis we discovered the following facts:

Certain crime categories such as robberies, assault, theft, prostitution and drunkenness happen mostly in the evening/night in San Francisco.
These crimes happen mostly in the city center and southeast parts of San Francisco at night, but the west side is relatively safe at night.
Violent crimes such as assault happen as often at night in San Francisco as in Seattle (with a slight edge to San Francisco), but non-violent crimes happen more often in Seattle.

	IncidntNum	Category	Descript	DayOfWeek	Date	Time	PdDistrict	Resolution	Address	X	Y	Location	PdId
11353	140626439	ASSAULT	BATTERY	Monday	07/28/2014	14:38	SOUTHERN	NONE	900 Block of MARKET ST	-122.409708	37.782828	(37.7828276498838, -122.409708423721)	14062643904134
21994	140525665	VEHICLE THEFT	STOLEN AUTOMOBILE	Tuesday	06/24/2014	19:00	TARAVAL	NONE	1600 Block of 33RD AV	-122.491600	37.756361	(37.7563610441277, -122.491600474332)	14052566507021
14276	140605091	NON-CRIMINAL	FOUND PROPERTY	Saturday	07/19/2014	14:00	RICHMOND	NONE	700 Block of 14TH AV	-122.472434	37.774026	(37.774025728225, -122.472433685031)	14060509172000
5391	140682845	ASSAULT	THREATS AGAINST LIFE	Friday	08/15/2014	12:30	SOUTHERN	NONE	100 Block of 6TH ST	-122.408467	37.780701	(37.7807008318778, -122.408466556173)	14068284519057
26867	140475402	ASSAULT	BATTERY	Saturday	06/07/2014	20:40	SOUTHERN	NONE	MARKET ST / 5TH ST	-122.408068	37.783992	(37.7839916768161, -122.408068345148)	14047540204134

	RMS CDW ID	General Offense Number	Offense Code	Offense Code Extension	Offense Type	Summary Offense Code	Summarized Offense Description	Date Reported	Occurred Date or Date Range Start	Occurred Date Range End	Hundred Block Location	District/Sector	Zone/Beat	Census Tract 2000	Longitude	Latitude	Location	Month	Year
26282	1100244	2014197267	2202	0	BURGLARY-FORCE-RES	2200	BURGLARY	06/20/2014 01:39:00 PM	06/20/2014 01:39:00 PM	NaN	72XX BLOCK OF 33 AV S	O	O3	11000.3000	-122.289972	47.537709	(47.537708622, -122.289972172)	6	2014
1761	37280	2014288298	1206	0	ROBBERY-STREET-BODYFORCE	1200	ROBBERY	08/29/2014 10:51:00 AM	08/25/2014 11:00:00 PM	NaN	5XX BLOCK OF S JACKSON ST	K	K3	9200.1012	-122.327014	47.599194	(47.599193846, -122.327013757)	8	2014
502	1058046	2014299212	2399	3	THEFT-OTH	2300	OTHER PROPERTY	09/07/2014 02:03:00 PM	06/09/2014 12:00:00 AM	06/10/2014 11:59:00 PM	42XX BLOCK OF SUNNYSIDE AV N	J	J3	5100.2007	-122.330202	47.658529	(47.658529234, -122.330201991)	6	2014
11773	1040385	2014250532	2399	1	THEFT-BICYCLE	2300	BIKE THEFT	07/30/2014 06:50:00 PM	07/30/2014 10:30:00 AM	07/30/2014 11:55:00 AM	14XX BLOCK OF NE 66 ST	U	U1	2600.3005	-122.312540	47.676512	(47.67651215, -122.312540197)	7	2014
24198	51700	2014204727	2903	0	PROPERTY DAMAGE-NON RESIDENTIA	2900	PROPERTY DAMAGE	06/26/2014 05:53:00 AM	06/26/2014 05:53:00 AM	06/26/2014 05:53:00 AM	130XX BLOCK OF LAKE CITY WY NE	L	L3	100.3002	-122.292841	47.723795	(47.723795094, -122.292840975)	6	2014

	City	DayOfMonth	DayOfWeek	Month	Year	Time	TimeBin	Longitude	Latitude	Address	District	Category	Description
1062	sanfrancisco	28	Thursday	8	2014	21:30	21	-122.406537	37.791342	600 Block of PINE ST	CENTRAL	LARCENY/THEFT	GRAND THEFT FROM UNLOCKED AUTO
5018	sanfrancisco	16	Saturday	8	2014	14:30	14	-122.406683	37.766476	200 Block of UTAH ST	MISSION	LARCENY/THEFT	GRAND THEFT OF PROPERTY
15827	sanfrancisco	14	Monday	7	2014	12:00	12	-122.401378	37.789264	MARKET ST / 2ND ST	SOUTHERN	NON-CRIMINAL	LOST PROPERTY
25030	sanfrancisco	14	Saturday	6	2014	12:51	12	-122.409797	37.786422	300 Block of MASON ST	TENDERLOIN	VEHICLE THEFT	STOLEN AND RECOVERED VEHICLE
18134	sanfrancisco	6	Sunday	7	2014	20:10	20	-122.391686	37.779769	100 Block of TOWNSEND ST	SOUTHERN	LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO

	City	DayOfMonth	DayOfWeek	Month	Year	Time	TimeBin	Longitude	Latitude	Address	District	Category	Description
4883	sanfrancisco	16	Saturday	8	2014	21:22	21	-122.406887	37.788735	300 Block of STOCKTON ST	CENTRAL	LARCENY/THEFT	GRAND THEFT FROM A BUILDING
16842	sanfrancisco	11	Friday	7	2014	00:01	0	-122.465943	37.750908	500 Block of PACHECO ST	TARAVAL	LARCENY/THEFT	GRAND THEFT FROM LOCKED AUTO
22043	sanfrancisco	24	Tuesday	6	2014	16:10	16	-122.474103	37.743174	700 Block of TARAVAL ST	TARAVAL	ASSAULT	AGGRAVATED ASSAULT WITH A DEADLY WEAPON
14909	sanfrancisco	17	Thursday	7	2014	14:00	14	-122.450001	37.728418	300 Block of JUDSON AV	INGLESIDE	WARRANTS	ENROUTE TO OUTSIDE JURISDICTION
26823	sanfrancisco	7	Saturday	6	2014	23:00	23	-122.443273	37.723086	OCEAN AV / DELANO AV	INGLESIDE	LARCENY/THEFT	PETTY THEFT FROM LOCKED AUTO

	poly	ward_name	area_m	area_km	patches
15	POLYGON ((6416.049162514857 11006.46735719449,...	Marina	1467340.584969	14.673406	Poly((6416.05, 11006.5) ...)
58	POLYGON ((3980.152562626965 3905.410716075829,...	Merced Manor	578405.297207	5.784053	Poly((3980.15, 3905.41) ...)
48	POLYGON ((6402.054027197912 6210.814662725435,...	Upper Market	541463.655398	5.414637	Poly((6402.05, 6210.81) ...)
7	POLYGON ((338.4060199953601 8397.84818246824, ...	Outer Richmond	2955371.801994	29.553718	Poly((338.406, 8397.85) ...)
13	POLYGON ((6189.674489750905 10552.5053630607, ...	Cow Hollow	394907.374613	3.949074	Poly((6189.67, 10552.5) ...)

	City	DayOfMonth	DayOfWeek	Month	Year	Time	TimeBin	Longitude	Latitude	Address	District	Category	Description
30353	seattle	6	Friday	6	2014	21:00	21	-122.339568	47.720939	125XX BLOCK OF ASHWORTH AV N	N	THEFT-CARPROWL	CAR PROWL
752	seattle	16	Monday	6	2014	06:00	6	-122.334291	47.624539	FAIRVIEW AV N / MERCER ST	D	THEFT-OTH	OTHER PROPERTY
13492	seattle	25	Friday	7	2014	18:24	18	-122.348015	47.525567	86XX BLOCK OF 10 AV SW	F	BURGLARY-FORCE-RES	BURGLARY
32716	seattle	1	Sunday	6	2014	09:57	9	-122.360699	47.696949	92XX BLOCK OF 3 AV NW	N	BURGLARY-FORCE-RES	BURGLARY
23728	seattle	27	Friday	6	2014	11:50	11	-122.338450	47.608789	14XX BLOCK OF 2 AV	M	THEFT-CARPROWL	CAR PROWL

	poly	ward_name	area_m	area_km	patches	count	density_m	density_km
115	POLYGON ((3980.152562626965 3905.410716075829,...	St. Francis Wood	433198.742391	4.331987	Poly((3980.15, 3905.41) ...)	6	0.000014	1.385046
50	POLYGON ((8335.076691982184 7787.299476736942,...	Mission	3115872.461305	31.158725	Poly((8335.08, 7787.3) ...)	500	0.000160	16.046870
90	POLYGON ((7527.326043150293 3224.861639991456,...	Mission Terrace	1274966.081373	12.749661	Poly((7527.33, 3224.86) ...)	70	0.000055	5.490342
91	POLYGON ((7539.991345278126 1867.318677264847,...	Excelsior	1674291.038461	16.742910	Poly((7539.99, 1867.32) ...)	97	0.000058	5.793497
113	POLYGON ((5723.235722171237 6868.30994159828, ...	Ashbury Heights	343827.466495	3.438275	Poly((5723.24, 6868.31) ...)	10	0.000029	2.908435

	sanfrancisco	seattle
Category
ROBBERY	0.276204	0.723796
WARRANTS	0.673259	0.326741
PROSTITUTION	0.314159	0.685841
ASSAULT	0.545422	0.454578
FORGERY/COUNTERFEITING	0.070707	0.929293