Spatial data in pytidycensus

Open in Colab

This notebook demonstrates how to work with spatial data in pytidycensus, including creating maps and performing spatial analysis with GeoPandas.

Setup

First, let’s install and import the necessary packages:

# Uncomment to install if running in Colab
# !pip install pytidycensus geopandas matplotlib contextily

import pytidycensus as tc
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.colors import LinearSegmentedColormap
import warnings
warnings.filterwarnings('ignore')

# Set up matplotlib
plt.style.use('default')
%matplotlib inline

Census API Key Setup

Set your Census API key (get one at https://api.census.gov/data/key_signup.html):

# Replace with your actual API key
# tc.set_census_api_key("Your API Key Here")

Ignore the next cell. I am just loading my credentials from a yaml file in the parent directory.

import os

# Try to get API key from environment
api_key = os.environ.get("CENSUS_API_KEY")

# For documentation builds without a key, we'll mock the responses
try:
    tc.set_census_api_key(api_key)
    print("Using Census API key from environment")
except Exception:
    print("Using example API key for documentation")
    # This won't make real API calls during documentation builds
    tc.set_census_api_key("EXAMPLE_API_KEY_FOR_DOCS")
Census API key has been set for this session.
Using Census API key from environment

Getting Spatial Data

When you set geometry=True in pytidycensus functions, the package automatically downloads the corresponding TIGER/Line shapefiles and merges them with your data.

Example: Median Household Income in Orange County, CA

Let’s get median household income for Census tracts in Orange County, California:

# Get ACS data with geometry for Orange County, CA
orange = tc.get_acs(
    state="CA",
    county="Orange",
    geography="tract",
    variables="B19013_001",  # Median household income
    geometry=True,
    year=2022
)

print(f"Data type: {type(orange)}")
print(f"Shape: {orange.shape}")
print(f"CRS: {orange.crs}")
orange.head()
Getting data from the 2018-2022 5-year ACS
Loading tract boundaries...
Data type: <class 'geopandas.geodataframe.GeoDataFrame'>
Shape: (614, 8)
CRS: EPSG:4269
GEOID geometry B19013_001E state county tract NAME B19013_001_moe
0 06059990100 POLYGON ((-118.14153 33.70362, -118.12858 33.7... <NA> 06 059 990100 Orange County, California <NA>
1 06059062649 POLYGON ((-117.7616 33.58442, -117.76157 33.58... 75028 06 059 062649 Orange County, California 18255.0
2 06059032058 POLYGON ((-117.65902 33.55605, -117.65666 33.5... 232719 06 059 032058 Orange County, California 29201.0
3 06059032061 POLYGON ((-117.66979 33.52778, -117.66967 33.5... 154940 06 059 032061 Orange County, California 26818.0
4 06059021813 POLYGON ((-117.83796 33.85113, -117.83795 33.8... <NA> 06 059 021813 Orange County, California <NA>

The returned object is a GeoDataFrame with a geometry column containing the spatial boundaries. The default coordinate system is NAD83 (EPSG:4269).

Basic Choropleth Mapping

Let’s create a basic choropleth map:

# Create a basic choropleth map with missing data highlighted
fig, ax = plt.subplots(figsize=(12, 10))

# Create a mask for missing values
missing_mask = orange["B19013_001E"].isna()

# Plot non-missing data with color scheme
orange[~missing_mask].plot(
    column="B19013_001E",
    cmap="viridis",
    linewidth=0.1,
    edgecolor="white",
    legend=True,
    ax=ax,
    scheme="quantiles",  # or "equal_interval"
    k=5,
)

# Plot missing data with distinct color
orange[missing_mask].plot(
    color="lightgray",  # or any color for missing data
    linewidth=0.1,
    edgecolor="white",
    ax=ax,
)

ax.set_title(
    "Median Household Income by Census Tract\nOrange County, CA (2018-2022 ACS)",
    fontsize=14,
    fontweight="bold",
)
ax.set_axis_off()

# Add missing data to legend
from matplotlib.patches import Patch

handles, labels = ax.get_legend_handles_labels()
handles.append(Patch(facecolor="lightgray", label="Missing Data"))
ax.legend(handles=handles, loc="upper right")

plt.tight_layout()
plt.show()
../_images/f1ccca79f1c331cdc7134f6b80ee2b2b656fa4bf5b4ce104125167b101327059.png

Coordinate Reference Systems

For better visualization and analysis, it’s often useful to project the data to an appropriate coordinate system:

# Project to California Albers (EPSG:3310) for better area representation
orange_projected = orange.to_crs('EPSG:3310')

fig, ax = plt.subplots(figsize=(12, 10))

orange_projected.plot(
    column="B19013_001E",
    cmap="magma",
    linewidth=0.1,
    edgecolor="white",
    legend=False,
    ax=ax,
)

ax.set_title('Median Household Income (Projected)\nOrange County, CA', 
             fontsize=14, fontweight='bold')
ax.set_axis_off()

plt.tight_layout()
plt.show()
../_images/7197dea70b14cbbaa135fd9416c091a81ddff12cc0f4fe1de4201d93f168cbdc.png

Faceted Mapping

One powerful feature is creating small multiples to compare different variables. Let’s look at racial demographics in Harris County, Texas.

Race/Ethnicity Data with Summary Variable

# Define race/ethnicity variables for 2020 Census
race_vars = {
    "White": "P2_005N",
    "Black": "P2_006N", 
    "Asian": "P2_008N",
    "Hispanic": "P2_002N"
}

# Get data for Harris County, TX with total population as summary variable
harris = tc.get_decennial(
    geography="tract",
    variables=race_vars,
    state="TX",
    county="Harris",
    geometry=True,
    summary_var="P2_001N",  # Total population
    year=2020,
    sumfile="pl"
)

print(f"Shape: {harris.shape}")
harris.head()
Getting data from the 2020 decennial Census
Using the PL 94-171 Redistricting Data Summary File
Loading tract boundaries...
Shape: (4460, 13)
GEOID NAME_x geometry STATEFP COUNTYFP TRACTCE state county tract NAME_y variable estimate summary_est
0 48201233105 2331.05 POLYGON ((-95.16994 29.79037, -95.16731 29.790... 48 201 233105 48 201 233105 Harris County, Texas White 373 3734
1 48201233105 2331.05 POLYGON ((-95.16994 29.79037, -95.16731 29.790... 48 201 233105 48 201 233105 Harris County, Texas Black 62 3734
2 48201233105 2331.05 POLYGON ((-95.16994 29.79037, -95.16731 29.790... 48 201 233105 48 201 233105 Harris County, Texas Asian 2 3734
3 48201233105 2331.05 POLYGON ((-95.16994 29.79037, -95.16731 29.790... 48 201 233105 48 201 233105 Harris County, Texas Hispanic 3268 3734
4 48201252201 2522.01 POLYGON ((-95.1753 29.83223, -95.17486 29.8324... 48 201 252201 48 201 252201 Harris County, Texas White 376 4838

Notice that we have multiple rows per tract (one for each race/ethnicity variable) and a summary_value column with the total population.

Calculate Percentages and Create Faceted Map

# Calculate percentage of total population
harris["percent"] = 100 * (harris["estimate"] / harris["summary_est"])
harris.head()
GEOID NAME_x geometry STATEFP COUNTYFP TRACTCE state county tract NAME_y variable estimate summary_est percent
0 48201233105 2331.05 POLYGON ((-95.16994 29.79037, -95.16731 29.790... 48 201 233105 48 201 233105 Harris County, Texas White 373 3734 9.989288
1 48201233105 2331.05 POLYGON ((-95.16994 29.79037, -95.16731 29.790... 48 201 233105 48 201 233105 Harris County, Texas Black 62 3734 1.660418
2 48201233105 2331.05 POLYGON ((-95.16994 29.79037, -95.16731 29.790... 48 201 233105 48 201 233105 Harris County, Texas Asian 2 3734 0.053562
3 48201233105 2331.05 POLYGON ((-95.16994 29.79037, -95.16731 29.790... 48 201 233105 48 201 233105 Harris County, Texas Hispanic 3268 3734 87.520086
4 48201252201 2522.01 POLYGON ((-95.1753 29.83223, -95.17486 29.8324... 48 201 252201 48 201 252201 Harris County, Texas White 376 4838 7.771807
# Create faceted map
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.ravel()

# Get unique variables for iteration
variables = harris['variable'].unique()

for i, var in enumerate(variables):
    # Filter data for this variable
    subset = harris[harris['variable'] == var]
    
    # Create map
    subset.plot(
        column='percent',
        cmap='viridis',
        linewidth=0,
        legend=True,
        ax=axes[i],
        vmin=0,
        vmax=80  # Set consistent scale
    )
    
    axes[i].set_title(f'{var}', fontsize=12, fontweight='bold')
    axes[i].set_axis_off()

plt.suptitle('Racial and Ethnic Geography of Harris County, TX\n2020 Census (% of Total Population)', 
             fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
../_images/66225aaee8e0bfc7edb871484de783ff723849af55d1d3a3f67214924b93b68f.png

Interactive Analysis

Let’s explore the data interactively by looking at summary statistics:

# Summary statistics by race/ethnicity
summary_stats = harris.groupby('variable')['percent'].describe()
print("Percentage Distribution by Race/Ethnicity:")
print(summary_stats.round(2))
Percentage Distribution by Race/Ethnicity:
           count   mean    std  min    25%    50%    75%    max
variable                                                       
Asian     1113.0   7.27   8.11  0.0   1.29   4.82  10.37  59.71
Black     1113.0  19.19  18.12  0.0   5.36  13.35  27.14  91.57
Hispanic  1113.0  43.02  23.88  0.0  22.84  37.65  60.16  96.52
White     1113.0  27.14  22.76  0.0   6.67  19.54  46.49  82.94
# Create box plots to show distribution
fig, ax = plt.subplots(figsize=(10, 6))

harris.boxplot(column='percent', by='variable', ax=ax)
ax.set_title('Distribution of Race/Ethnicity Percentages by Census Tract\nHarris County, TX')
ax.set_xlabel('Race/Ethnicity Group')
ax.set_ylabel('Percentage of Total Population')
plt.xticks(rotation=45)
plt.suptitle('')  # Remove automatic title
plt.tight_layout()
plt.show()
../_images/9538cb340fac5a06991e3a5ac0d8aa41b8f599e3982346ca894465cd34fd1910.png

Working with Different Geographic Levels

State-Level Data

Let’s create a map of median household income for all US states:

# Get median household income for all states
states_income = tc.get_acs(
    geography="state",
    variables="B19013_001",
    year=2022,
    geometry=True
)
states_income.head()
Getting data from the 2018-2022 5-year ACS
Loading state boundaries...
GEOID geometry STUSPS B19013_001E state NAME B19013_001_moe
0 54 POLYGON ((-77.75438 39.33346, -77.75422 39.333... WV 55217 54 West Virginia 577.0
1 12 MULTIPOLYGON (((-83.10874 24.62949, -83.10711 ... FL 67917 12 Florida 259.0
2 17 POLYGON ((-87.89243 38.28285, -87.89334 38.282... IL 78433 17 Illinois 297.0
3 27 POLYGON ((-95.31989 48.99892, -95.31747 48.998... MN 84313 27 Minnesota 348.0
4 24 POLYGON ((-75.756 39.24607, -75.75578 39.24334... MD 98461 24 Maryland 511.0
# Remove territories for cleaner continental US map
states_continental = states_income[
    ~states_income["NAME"].isin(
        [
            "Puerto Rico",
            "United States Virgin Islands",
            "Guam",
            "American Samoa",
            "Commonwealth of the Northern Mariana Islands",
        ]
    )
]

print(f"Number of states/DC: {len(states_continental)}")
Number of states/DC: 51
# Project to Albers Equal Area for US mapping
states_albers = states_continental.to_crs('EPSG:5070')

fig, ax = plt.subplots(figsize=(15, 10))

states_albers.plot(
    column="B19013_001E",
    cmap="RdYlBu_r",
    linewidth=0.5,
    edgecolor="white",
    legend=True,
    ax=ax,
)

ax.set_title('Median Household Income by State\n2018-2022 American Community Survey', 
             fontsize=16, fontweight='bold')
ax.set_axis_off()

# Add colorbar label
cbar = ax.get_figure().get_axes()[1]
cbar.set_ylabel('Median Household Income ($)', rotation=270, labelpad=20)

plt.tight_layout()
plt.show()
../_images/fde772f1bcfa8aec9ae072a7e9bdadbc0b724c14618cb33c84ba0a6450ec9fe4.png

County-Level Analysis

Let’s look at population density across counties in a specific state:

# Get population and calculate density for California counties
ca_counties = tc.get_acs(
    geography="county",
    variables="B01003_001",  # Total population
    state="CA",
    year=2022,
    geometry=True
)
ca_counties.head()
Getting data from the 2018-2022 5-year ACS
Loading county boundaries...
GEOID geometry NAMELSAD B01003_001E state county NAME B01003_001_moe
0 06091 POLYGON ((-120.55587 39.50874, -120.55614 39.5... Sierra County 2916 06 091 Sierra County, California 206.0
1 06067 POLYGON ((-121.43991 38.25553, -121.44002 38.2... Sacramento County 1579211 06 067 Sacramento County, California <NA>
2 06083 MULTIPOLYGON (((-120.58226 34.10752, -120.5790... Santa Barbara County 445213 06 083 Santa Barbara County, California <NA>
3 06009 POLYGON ((-120.6318 38.34603, -120.63066 38.34... Calaveras County 45674 06 009 Calaveras County, California <NA>
4 06111 MULTIPOLYGON (((-119.63607 33.28071, -119.6348... Ventura County 842009 06 111 Ventura County, California <NA>
# Calculate area in square kilometers and population density
ca_counties_proj = ca_counties.to_crs('EPSG:3310')  # California Albers
ca_counties_proj['area_km2'] = ca_counties_proj.geometry.area / 1e6
ca_counties_proj["density"] = (
    ca_counties_proj["B01003_001E"] / ca_counties_proj["area_km2"]
)

print("Top 10 most dense counties:")
print(ca_counties_proj.nlargest(10, 'density')[['NAME', 'density']].round(1))
Top 10 most dense counties:
                                NAME  density
10  San Francisco County, California   1416.8
47         Orange County, California   1290.9
5     Los Angeles County, California    807.8
41        Alameda County, California    782.3
1      Sacramento County, California    613.2
38    Santa Clara County, California    567.5
24   Contra Costa County, California    558.4
53      San Mateo County, California    393.0
8       San Diego County, California    280.6
40    San Joaquin County, California    211.0
# Map population density with log scale
fig, ax = plt.subplots(figsize=(12, 15))

# Use log scale for better visualization of density
ca_counties_proj['log_density'] = np.log10(ca_counties_proj['density'] + 1)

ca_counties_proj.plot(
    column='log_density',
    cmap='OrRd',
    linewidth=0.2,
    edgecolor='white',
    legend=True,
    ax=ax
)

ax.set_title('Population Density by County\nCalifornia (Log Scale)', 
             fontsize=14, fontweight='bold')
ax.set_axis_off()

plt.tight_layout()
plt.show()
../_images/cdaf6af239754ba8fc9660d17cfc5d008269813eac2d4117502cab488f0ff234.png

Advanced Visualization Techniques

Using Different Color Schemes

# Compare different color schemes
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

cmaps = ['viridis', 'plasma', 'RdYlBu_r', 'coolwarm']
titles = ['Viridis', 'Plasma', 'RdYlBu_r', 'Coolwarm']

for i, (cmap, title) in enumerate(zip(cmaps, titles)):
    ca_counties_proj.plot(
        column="log_density",
        cmap=cmap,
        linewidth=0.1,
        edgecolor="white",
        legend=True,
        ax=axes[i],
    )
    axes[i].set_title(f'{title} Color Scheme', fontweight='bold')
    axes[i].set_axis_off()

plt.suptitle('Median Household Income - Different Color Schemes\nOrange County, CA', 
             fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
../_images/878e711d8d060e4e6edd285499dc6c60ff5bbc49d4298f3a245829448a3538e3.png

Classification Schemes

Sometimes it’s useful to classify continuous data into discrete categories:

# Create income categories
orange_classified = orange_projected.copy()
orange_classified.dropna(subset=["B19013_001E"], inplace=True)

# Define income brackets
bins = [0, 50000, 75000, 100000, 150000, float('inf')]
labels = ['< $50K', '$50K-$75K', '$75K-$100K', '$100K-$150K', '> $150K']

orange_classified["income_category"] = pd.cut(
    orange_classified["B19013_001E"], bins=bins, labels=labels, include_lowest=True
)

# Create custom color palette
colors = ['#fee5d9', '#fcbba1', '#fc9272', '#fb6a4a', '#de2d26']


from matplotlib.patches import Patch
 
# Create legend handles for each category
legend_handles = [
    Patch(facecolor=color, edgecolor="white", label=label)
    for label, color in zip(labels, colors)
]

fig, ax = plt.subplots(figsize=(12, 10))

for category, color in zip(labels, colors):
    subset = orange_classified[orange_classified["income_category"] == category]
    subset.plot(color=color, linewidth=0.1, edgecolor="white", ax=ax)

ax.set_title(
    "Median Household Income Categories\nOrange County, CA",
    fontsize=14,
    fontweight="bold",
)
ax.set_axis_off()
ax.legend(
    handles=legend_handles,
    title="Income Category",
    loc="upper left",
    bbox_to_anchor=(1, 1),
)

plt.tight_layout()
plt.show()
../_images/c47d3bf3828bdd5f664a1b75aa58da500c31e9074c13ce98b849955bfd225445.png

Spatial Analysis

Let’s perform some basic spatial analysis:

# Calculate centroids and identify high-income clusters
orange_analysis = orange_projected.copy()
orange_analysis['centroid'] = orange_analysis.geometry.centroid

# Identify high-income tracts (top quartile)
high_income_threshold = orange_analysis["B19013_001E"].quantile(0.75)
orange_analysis["high_income"] = orange_analysis["B19013_001E"] >= high_income_threshold

print(f"High income threshold: ${high_income_threshold:,.0f}")
print(f"Number of high-income tracts: {orange_analysis['high_income'].sum()}")
print(f"Percentage of tracts: {100 * orange_analysis['high_income'].mean():.1f}%")
High income threshold: $140,567
Number of high-income tracts: 153
Percentage of tracts: 24.9%
# Visualize high-income areas
fig, ax = plt.subplots(figsize=(12, 10))

# Plot all tracts in light gray
orange_analysis.plot(
    color='lightgray',
    linewidth=0.1,
    edgecolor='white',
    ax=ax
)

# Highlight high-income tracts
high_income_tracts = orange_analysis[orange_analysis['high_income']]
high_income_tracts.plot(
    color='darkred',
    linewidth=0.1,
    edgecolor='white',
    ax=ax,
    alpha=0.8
)

ax.set_title(f'High-Income Census Tracts\nOrange County, CA (Top 25%, >${high_income_threshold:,.0f})', 
             fontsize=14, fontweight='bold')
ax.set_axis_off()

# Add legend
from matplotlib.patches import Patch
legend_elements = [
    Patch(facecolor='lightgray', label='Other tracts'),
    Patch(facecolor='darkred', label=f'High income (>${high_income_threshold:,.0f}+)')
]
ax.legend(handles=legend_elements, loc='upper right')

plt.tight_layout()
plt.show()
../_images/62d8e161335d185315b9172f5cb1c236bb5ef7abd7092cc310b71d98e2587f83.png

Summary Statistics by Geography

Let’s compute some summary statistics for our spatial data:

# Summary statistics for Orange County income data
print("Median Household Income Statistics - Orange County, CA")
print("=" * 55)
print(f"Mean:     ${orange['B19013_001E'].mean():,.0f}")
print(f"Median:   ${orange['B19013_001E'].median():,.0f}")
print(f"Std Dev:  ${orange['B19013_001E'].std():,.0f}")
print(f"Min:      ${orange['B19013_001E'].min():,.0f}")
print(f"Max:      ${orange['B19013_001E'].max():,.0f}")
print(f"Range:    ${orange['B19013_001E'].max() - orange['B19013_001E'].min():,.0f}")

# Quartile analysis
print("\nQuartile Analysis:")
quartiles = orange["B19013_001E"].quantile([0.25, 0.5, 0.75])
for q, val in quartiles.items():
    print(f"{int(q*100)}th percentile: ${val:,.0f}")
Median Household Income Statistics - Orange County, CA
=======================================================
Mean:     $117,036
Median:   $112,780
Std Dev:  $40,207
Min:      $36,441
Max:      $250,001
Range:    $213,560

Quartile Analysis:
25th percentile: $85,860
50th percentile: $112,780
75th percentile: $140,567

Saving Spatial Data

You can save your spatial data in various formats:

# Save to different formats (uncomment to save)
# orange.to_file("orange_county_income.shp")  # Shapefile
# orange.to_file("orange_county_income.geojson", driver="GeoJSON")  # GeoJSON
# orange.to_parquet("orange_county_income.parquet")  # Parquet (preserves geometry)

print("File formats available for saving:")
print("- Shapefile (.shp)")
print("- GeoJSON (.geojson)")
print("- Parquet (.parquet)")
print("- PostGIS database")
print("- And many more via GeoPandas!")
File formats available for saving:
- Shapefile (.shp)
- GeoJSON (.geojson)
- Parquet (.parquet)
- PostGIS database
- And many more via GeoPandas!

Summary

In this notebook, we learned how to:

  1. Get spatial data by setting geometry=True in pytidycensus functions

  2. Create choropleth maps using GeoPandas and matplotlib

  3. Work with coordinate systems and projections

  4. Create faceted maps for comparing multiple variables

  5. Use different visualization techniques including color schemes and classification

  6. Perform basic spatial analysis like identifying spatial patterns

  7. Save spatial data in various formats

Next Steps

  • Advanced Spatial Analysis: Use spatial statistics packages like PySAL

  • Interactive Maps: Create web maps with folium or plotly

  • 3D Visualization: Use packages like pydeck for 3D mapping

  • Web Mapping: Deploy maps with streamlit or dash

Resources