Spatial data in pytidycensus
This notebook demonstrates how to work with spatial data in pytidycensus, including creating maps and performing spatial analysis with GeoPandas.
Setup
First, let’s install and import the necessary packages:
# Uncomment to install if running in Colab
# !pip install pytidycensus geopandas matplotlib contextily
import pytidycensus as tc
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LinearSegmentedColormap
import warnings
warnings.filterwarnings('ignore')
# Set up matplotlib
plt.style.use('default')
%matplotlib inline
Census API Key Setup
Set your Census API key (get one at https://api.census.gov/data/key_signup.html):
# Replace with your actual API key
# tc.set_census_api_key("Your API Key Here")
Ignore the next cell. I am just loading my credentials from a yaml file in the parent directory.
import os
# Try to get API key from environment
api_key = os.environ.get("CENSUS_API_KEY")
# For documentation builds without a key, we'll mock the responses
try:
tc.set_census_api_key(api_key)
print("Using Census API key from environment")
except Exception:
print("Using example API key for documentation")
# This won't make real API calls during documentation builds
tc.set_census_api_key("EXAMPLE_API_KEY_FOR_DOCS")
Census API key has been set for this session.
Using Census API key from environment
Getting Spatial Data
When you set geometry=True
in pytidycensus functions, the package automatically downloads the corresponding TIGER/Line shapefiles and merges them with your data.
Example: Median Household Income in Orange County, CA
Let’s get median household income for Census tracts in Orange County, California:
# Get ACS data with geometry for Orange County, CA
orange = tc.get_acs(
state="CA",
county="Orange",
geography="tract",
variables="B19013_001", # Median household income
geometry=True,
year=2022
)
print(f"Data type: {type(orange)}")
print(f"Shape: {orange.shape}")
print(f"CRS: {orange.crs}")
orange.head()
Getting data from the 2018-2022 5-year ACS
Loading tract boundaries...
Data type: <class 'geopandas.geodataframe.GeoDataFrame'>
Shape: (614, 8)
CRS: EPSG:4269
GEOID | geometry | B19013_001E | state | county | tract | NAME | B19013_001_moe | |
---|---|---|---|---|---|---|---|---|
0 | 06059990100 | POLYGON ((-118.14153 33.70362, -118.12858 33.7... | <NA> | 06 | 059 | 990100 | Orange County, California | <NA> |
1 | 06059062649 | POLYGON ((-117.7616 33.58442, -117.76157 33.58... | 75028 | 06 | 059 | 062649 | Orange County, California | 18255.0 |
2 | 06059032058 | POLYGON ((-117.65902 33.55605, -117.65666 33.5... | 232719 | 06 | 059 | 032058 | Orange County, California | 29201.0 |
3 | 06059032061 | POLYGON ((-117.66979 33.52778, -117.66967 33.5... | 154940 | 06 | 059 | 032061 | Orange County, California | 26818.0 |
4 | 06059021813 | POLYGON ((-117.83796 33.85113, -117.83795 33.8... | <NA> | 06 | 059 | 021813 | Orange County, California | <NA> |
The returned object is a GeoDataFrame with a geometry
column containing the spatial boundaries. The default coordinate system is NAD83 (EPSG:4269).
Basic Choropleth Mapping
Let’s create a basic choropleth map:
# Create a basic choropleth map with missing data highlighted
fig, ax = plt.subplots(figsize=(12, 10))
# Create a mask for missing values
missing_mask = orange["B19013_001E"].isna()
# Plot non-missing data with color scheme
orange[~missing_mask].plot(
column="B19013_001E",
cmap="viridis",
linewidth=0.1,
edgecolor="white",
legend=True,
ax=ax,
scheme="quantiles", # or "equal_interval"
k=5,
)
# Plot missing data with distinct color
orange[missing_mask].plot(
color="lightgray", # or any color for missing data
linewidth=0.1,
edgecolor="white",
ax=ax,
)
ax.set_title(
"Median Household Income by Census Tract\nOrange County, CA (2018-2022 ACS)",
fontsize=14,
fontweight="bold",
)
ax.set_axis_off()
# Add missing data to legend
from matplotlib.patches import Patch
handles, labels = ax.get_legend_handles_labels()
handles.append(Patch(facecolor="lightgray", label="Missing Data"))
ax.legend(handles=handles, loc="upper right")
plt.tight_layout()
plt.show()

Coordinate Reference Systems
For better visualization and analysis, it’s often useful to project the data to an appropriate coordinate system:
# Project to California Albers (EPSG:3310) for better area representation
orange_projected = orange.to_crs('EPSG:3310')
fig, ax = plt.subplots(figsize=(12, 10))
orange_projected.plot(
column="B19013_001E",
cmap="magma",
linewidth=0.1,
edgecolor="white",
legend=False,
ax=ax,
)
ax.set_title('Median Household Income (Projected)\nOrange County, CA',
fontsize=14, fontweight='bold')
ax.set_axis_off()
plt.tight_layout()
plt.show()

Faceted Mapping
One powerful feature is creating small multiples to compare different variables. Let’s look at racial demographics in Harris County, Texas.
Race/Ethnicity Data with Summary Variable
# Define race/ethnicity variables for 2020 Census
race_vars = {
"White": "P2_005N",
"Black": "P2_006N",
"Asian": "P2_008N",
"Hispanic": "P2_002N"
}
# Get data for Harris County, TX with total population as summary variable
harris = tc.get_decennial(
geography="tract",
variables=race_vars,
state="TX",
county="Harris",
geometry=True,
summary_var="P2_001N", # Total population
year=2020,
sumfile="pl"
)
print(f"Shape: {harris.shape}")
harris.head()
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data Summary File
Loading tract boundaries...
Shape: (4460, 13)
GEOID | NAME_x | geometry | STATEFP | COUNTYFP | TRACTCE | state | county | tract | NAME_y | variable | estimate | summary_est | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 48201233105 | 2331.05 | POLYGON ((-95.16994 29.79037, -95.16731 29.790... | 48 | 201 | 233105 | 48 | 201 | 233105 | Harris County, Texas | White | 373 | 3734 |
1 | 48201233105 | 2331.05 | POLYGON ((-95.16994 29.79037, -95.16731 29.790... | 48 | 201 | 233105 | 48 | 201 | 233105 | Harris County, Texas | Black | 62 | 3734 |
2 | 48201233105 | 2331.05 | POLYGON ((-95.16994 29.79037, -95.16731 29.790... | 48 | 201 | 233105 | 48 | 201 | 233105 | Harris County, Texas | Asian | 2 | 3734 |
3 | 48201233105 | 2331.05 | POLYGON ((-95.16994 29.79037, -95.16731 29.790... | 48 | 201 | 233105 | 48 | 201 | 233105 | Harris County, Texas | Hispanic | 3268 | 3734 |
4 | 48201252201 | 2522.01 | POLYGON ((-95.1753 29.83223, -95.17486 29.8324... | 48 | 201 | 252201 | 48 | 201 | 252201 | Harris County, Texas | White | 376 | 4838 |
Notice that we have multiple rows per tract (one for each race/ethnicity variable) and a summary_value
column with the total population.
Calculate Percentages and Create Faceted Map
# Calculate percentage of total population
harris["percent"] = 100 * (harris["estimate"] / harris["summary_est"])
harris.head()
GEOID | NAME_x | geometry | STATEFP | COUNTYFP | TRACTCE | state | county | tract | NAME_y | variable | estimate | summary_est | percent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 48201233105 | 2331.05 | POLYGON ((-95.16994 29.79037, -95.16731 29.790... | 48 | 201 | 233105 | 48 | 201 | 233105 | Harris County, Texas | White | 373 | 3734 | 9.989288 |
1 | 48201233105 | 2331.05 | POLYGON ((-95.16994 29.79037, -95.16731 29.790... | 48 | 201 | 233105 | 48 | 201 | 233105 | Harris County, Texas | Black | 62 | 3734 | 1.660418 |
2 | 48201233105 | 2331.05 | POLYGON ((-95.16994 29.79037, -95.16731 29.790... | 48 | 201 | 233105 | 48 | 201 | 233105 | Harris County, Texas | Asian | 2 | 3734 | 0.053562 |
3 | 48201233105 | 2331.05 | POLYGON ((-95.16994 29.79037, -95.16731 29.790... | 48 | 201 | 233105 | 48 | 201 | 233105 | Harris County, Texas | Hispanic | 3268 | 3734 | 87.520086 |
4 | 48201252201 | 2522.01 | POLYGON ((-95.1753 29.83223, -95.17486 29.8324... | 48 | 201 | 252201 | 48 | 201 | 252201 | Harris County, Texas | White | 376 | 4838 | 7.771807 |
# Create faceted map
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()
# Get unique variables for iteration
variables = harris['variable'].unique()
for i, var in enumerate(variables):
# Filter data for this variable
subset = harris[harris['variable'] == var]
# Create map
subset.plot(
column='percent',
cmap='viridis',
linewidth=0,
legend=True,
ax=axes[i],
vmin=0,
vmax=80 # Set consistent scale
)
axes[i].set_title(f'{var}', fontsize=12, fontweight='bold')
axes[i].set_axis_off()
plt.suptitle('Racial and Ethnic Geography of Harris County, TX\n2020 Census (% of Total Population)',
fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

Interactive Analysis
Let’s explore the data interactively by looking at summary statistics:
# Summary statistics by race/ethnicity
summary_stats = harris.groupby('variable')['percent'].describe()
print("Percentage Distribution by Race/Ethnicity:")
print(summary_stats.round(2))
Percentage Distribution by Race/Ethnicity:
count mean std min 25% 50% 75% max
variable
Asian 1113.0 7.27 8.11 0.0 1.29 4.82 10.37 59.71
Black 1113.0 19.19 18.12 0.0 5.36 13.35 27.14 91.57
Hispanic 1113.0 43.02 23.88 0.0 22.84 37.65 60.16 96.52
White 1113.0 27.14 22.76 0.0 6.67 19.54 46.49 82.94
# Create box plots to show distribution
fig, ax = plt.subplots(figsize=(10, 6))
harris.boxplot(column='percent', by='variable', ax=ax)
ax.set_title('Distribution of Race/Ethnicity Percentages by Census Tract\nHarris County, TX')
ax.set_xlabel('Race/Ethnicity Group')
ax.set_ylabel('Percentage of Total Population')
plt.xticks(rotation=45)
plt.suptitle('') # Remove automatic title
plt.tight_layout()
plt.show()

Working with Different Geographic Levels
State-Level Data
Let’s create a map of median household income for all US states:
# Get median household income for all states
states_income = tc.get_acs(
geography="state",
variables="B19013_001",
year=2022,
geometry=True
)
states_income.head()
Getting data from the 2018-2022 5-year ACS
Loading state boundaries...
GEOID | geometry | STUSPS | B19013_001E | state | NAME | B19013_001_moe | |
---|---|---|---|---|---|---|---|
0 | 54 | POLYGON ((-77.75438 39.33346, -77.75422 39.333... | WV | 55217 | 54 | West Virginia | 577.0 |
1 | 12 | MULTIPOLYGON (((-83.10874 24.62949, -83.10711 ... | FL | 67917 | 12 | Florida | 259.0 |
2 | 17 | POLYGON ((-87.89243 38.28285, -87.89334 38.282... | IL | 78433 | 17 | Illinois | 297.0 |
3 | 27 | POLYGON ((-95.31989 48.99892, -95.31747 48.998... | MN | 84313 | 27 | Minnesota | 348.0 |
4 | 24 | POLYGON ((-75.756 39.24607, -75.75578 39.24334... | MD | 98461 | 24 | Maryland | 511.0 |
# Remove territories for cleaner continental US map
states_continental = states_income[
~states_income["NAME"].isin(
[
"Puerto Rico",
"United States Virgin Islands",
"Guam",
"American Samoa",
"Commonwealth of the Northern Mariana Islands",
]
)
]
print(f"Number of states/DC: {len(states_continental)}")
Number of states/DC: 51
# Project to Albers Equal Area for US mapping
states_albers = states_continental.to_crs('EPSG:5070')
fig, ax = plt.subplots(figsize=(15, 10))
states_albers.plot(
column="B19013_001E",
cmap="RdYlBu_r",
linewidth=0.5,
edgecolor="white",
legend=True,
ax=ax,
)
ax.set_title('Median Household Income by State\n2018-2022 American Community Survey',
fontsize=16, fontweight='bold')
ax.set_axis_off()
# Add colorbar label
cbar = ax.get_figure().get_axes()[1]
cbar.set_ylabel('Median Household Income ($)', rotation=270, labelpad=20)
plt.tight_layout()
plt.show()

County-Level Analysis
Let’s look at population density across counties in a specific state:
# Get population and calculate density for California counties
ca_counties = tc.get_acs(
geography="county",
variables="B01003_001", # Total population
state="CA",
year=2022,
geometry=True
)
ca_counties.head()
Getting data from the 2018-2022 5-year ACS
Loading county boundaries...
GEOID | geometry | NAMELSAD | B01003_001E | state | county | NAME | B01003_001_moe | |
---|---|---|---|---|---|---|---|---|
0 | 06091 | POLYGON ((-120.55587 39.50874, -120.55614 39.5... | Sierra County | 2916 | 06 | 091 | Sierra County, California | 206.0 |
1 | 06067 | POLYGON ((-121.43991 38.25553, -121.44002 38.2... | Sacramento County | 1579211 | 06 | 067 | Sacramento County, California | <NA> |
2 | 06083 | MULTIPOLYGON (((-120.58226 34.10752, -120.5790... | Santa Barbara County | 445213 | 06 | 083 | Santa Barbara County, California | <NA> |
3 | 06009 | POLYGON ((-120.6318 38.34603, -120.63066 38.34... | Calaveras County | 45674 | 06 | 009 | Calaveras County, California | <NA> |
4 | 06111 | MULTIPOLYGON (((-119.63607 33.28071, -119.6348... | Ventura County | 842009 | 06 | 111 | Ventura County, California | <NA> |
# Calculate area in square kilometers and population density
ca_counties_proj = ca_counties.to_crs('EPSG:3310') # California Albers
ca_counties_proj['area_km2'] = ca_counties_proj.geometry.area / 1e6
ca_counties_proj["density"] = (
ca_counties_proj["B01003_001E"] / ca_counties_proj["area_km2"]
)
print("Top 10 most dense counties:")
print(ca_counties_proj.nlargest(10, 'density')[['NAME', 'density']].round(1))
Top 10 most dense counties:
NAME density
10 San Francisco County, California 1416.8
47 Orange County, California 1290.9
5 Los Angeles County, California 807.8
41 Alameda County, California 782.3
1 Sacramento County, California 613.2
38 Santa Clara County, California 567.5
24 Contra Costa County, California 558.4
53 San Mateo County, California 393.0
8 San Diego County, California 280.6
40 San Joaquin County, California 211.0
# Map population density with log scale
fig, ax = plt.subplots(figsize=(12, 15))
# Use log scale for better visualization of density
ca_counties_proj['log_density'] = np.log10(ca_counties_proj['density'] + 1)
ca_counties_proj.plot(
column='log_density',
cmap='OrRd',
linewidth=0.2,
edgecolor='white',
legend=True,
ax=ax
)
ax.set_title('Population Density by County\nCalifornia (Log Scale)',
fontsize=14, fontweight='bold')
ax.set_axis_off()
plt.tight_layout()
plt.show()

Advanced Visualization Techniques
Using Different Color Schemes
# Compare different color schemes
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()
cmaps = ['viridis', 'plasma', 'RdYlBu_r', 'coolwarm']
titles = ['Viridis', 'Plasma', 'RdYlBu_r', 'Coolwarm']
for i, (cmap, title) in enumerate(zip(cmaps, titles)):
ca_counties_proj.plot(
column="log_density",
cmap=cmap,
linewidth=0.1,
edgecolor="white",
legend=True,
ax=axes[i],
)
axes[i].set_title(f'{title} Color Scheme', fontweight='bold')
axes[i].set_axis_off()
plt.suptitle('Median Household Income - Different Color Schemes\nOrange County, CA',
fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

Classification Schemes
Sometimes it’s useful to classify continuous data into discrete categories:
# Create income categories
orange_classified = orange_projected.copy()
orange_classified.dropna(subset=["B19013_001E"], inplace=True)
# Define income brackets
bins = [0, 50000, 75000, 100000, 150000, float('inf')]
labels = ['< $50K', '$50K-$75K', '$75K-$100K', '$100K-$150K', '> $150K']
orange_classified["income_category"] = pd.cut(
orange_classified["B19013_001E"], bins=bins, labels=labels, include_lowest=True
)
# Create custom color palette
colors = ['#fee5d9', '#fcbba1', '#fc9272', '#fb6a4a', '#de2d26']
from matplotlib.patches import Patch
# Create legend handles for each category
legend_handles = [
Patch(facecolor=color, edgecolor="white", label=label)
for label, color in zip(labels, colors)
]
fig, ax = plt.subplots(figsize=(12, 10))
for category, color in zip(labels, colors):
subset = orange_classified[orange_classified["income_category"] == category]
subset.plot(color=color, linewidth=0.1, edgecolor="white", ax=ax)
ax.set_title(
"Median Household Income Categories\nOrange County, CA",
fontsize=14,
fontweight="bold",
)
ax.set_axis_off()
ax.legend(
handles=legend_handles,
title="Income Category",
loc="upper left",
bbox_to_anchor=(1, 1),
)
plt.tight_layout()
plt.show()

Spatial Analysis
Let’s perform some basic spatial analysis:
# Calculate centroids and identify high-income clusters
orange_analysis = orange_projected.copy()
orange_analysis['centroid'] = orange_analysis.geometry.centroid
# Identify high-income tracts (top quartile)
high_income_threshold = orange_analysis["B19013_001E"].quantile(0.75)
orange_analysis["high_income"] = orange_analysis["B19013_001E"] >= high_income_threshold
print(f"High income threshold: ${high_income_threshold:,.0f}")
print(f"Number of high-income tracts: {orange_analysis['high_income'].sum()}")
print(f"Percentage of tracts: {100 * orange_analysis['high_income'].mean():.1f}%")
High income threshold: $140,567
Number of high-income tracts: 153
Percentage of tracts: 24.9%
# Visualize high-income areas
fig, ax = plt.subplots(figsize=(12, 10))
# Plot all tracts in light gray
orange_analysis.plot(
color='lightgray',
linewidth=0.1,
edgecolor='white',
ax=ax
)
# Highlight high-income tracts
high_income_tracts = orange_analysis[orange_analysis['high_income']]
high_income_tracts.plot(
color='darkred',
linewidth=0.1,
edgecolor='white',
ax=ax,
alpha=0.8
)
ax.set_title(f'High-Income Census Tracts\nOrange County, CA (Top 25%, >${high_income_threshold:,.0f})',
fontsize=14, fontweight='bold')
ax.set_axis_off()
# Add legend
from matplotlib.patches import Patch
legend_elements = [
Patch(facecolor='lightgray', label='Other tracts'),
Patch(facecolor='darkred', label=f'High income (>${high_income_threshold:,.0f}+)')
]
ax.legend(handles=legend_elements, loc='upper right')
plt.tight_layout()
plt.show()

Summary Statistics by Geography
Let’s compute some summary statistics for our spatial data:
# Summary statistics for Orange County income data
print("Median Household Income Statistics - Orange County, CA")
print("=" * 55)
print(f"Mean: ${orange['B19013_001E'].mean():,.0f}")
print(f"Median: ${orange['B19013_001E'].median():,.0f}")
print(f"Std Dev: ${orange['B19013_001E'].std():,.0f}")
print(f"Min: ${orange['B19013_001E'].min():,.0f}")
print(f"Max: ${orange['B19013_001E'].max():,.0f}")
print(f"Range: ${orange['B19013_001E'].max() - orange['B19013_001E'].min():,.0f}")
# Quartile analysis
print("\nQuartile Analysis:")
quartiles = orange["B19013_001E"].quantile([0.25, 0.5, 0.75])
for q, val in quartiles.items():
print(f"{int(q*100)}th percentile: ${val:,.0f}")
Median Household Income Statistics - Orange County, CA
=======================================================
Mean: $117,036
Median: $112,780
Std Dev: $40,207
Min: $36,441
Max: $250,001
Range: $213,560
Quartile Analysis:
25th percentile: $85,860
50th percentile: $112,780
75th percentile: $140,567
Saving Spatial Data
You can save your spatial data in various formats:
# Save to different formats (uncomment to save)
# orange.to_file("orange_county_income.shp") # Shapefile
# orange.to_file("orange_county_income.geojson", driver="GeoJSON") # GeoJSON
# orange.to_parquet("orange_county_income.parquet") # Parquet (preserves geometry)
print("File formats available for saving:")
print("- Shapefile (.shp)")
print("- GeoJSON (.geojson)")
print("- Parquet (.parquet)")
print("- PostGIS database")
print("- And many more via GeoPandas!")
File formats available for saving:
- Shapefile (.shp)
- GeoJSON (.geojson)
- Parquet (.parquet)
- PostGIS database
- And many more via GeoPandas!
Summary
In this notebook, we learned how to:
Get spatial data by setting
geometry=True
in pytidycensus functionsCreate choropleth maps using GeoPandas and matplotlib
Work with coordinate systems and projections
Create faceted maps for comparing multiple variables
Use different visualization techniques including color schemes and classification
Perform basic spatial analysis like identifying spatial patterns
Save spatial data in various formats
Next Steps
Advanced Spatial Analysis: Use spatial statistics packages like
PySAL
Interactive Maps: Create web maps with
folium
orplotly
3D Visualization: Use packages like
pydeck
for 3D mappingWeb Mapping: Deploy maps with
streamlit
ordash