Getting Started
This guide will help you get started with pytidycensus, a Python library for accessing US Census data.
Installation
Install pytidycensus using pip:
pip install pytidycensus
For development installation:
git clone https://github.com/walkerke/tidycensus
cd tidycensus/pytidycensus
pip install -e .
Census API Key
To use pytidycensus, you need a free API key from the US Census Bureau:
Visit https://api.census.gov/data/key_signup.html
Fill out the form to request an API key
Check your email for the API key
Once you have your key, set it in Python:
import pytidycensus as tc
tc.set_census_api_key("your_api_key_here")
Alternatively, you can set it as an environment variable:
export CENSUS_API_KEY="your_api_key_here"
Basic Usage
Getting ACS Data
The American Community Survey (ACS) is the most commonly used Census dataset:
import pytidycensus as tc
# Get median household income by state
income_data = tc.get_acs(
geography="state",
variables="B19013_001",
year=2022
)
print(income_data.head())
Adding Geography
To include geographic boundaries for mapping:
# Get data with geometry
income_geo = tc.get_acs(
geography="state",
variables="B19013_001",
year=2022,
geometry=True
)
# Now you can map it
income_geo.plot(column='value', legend=True)
Multiple Variables
You can request multiple variables at once:
# Get population and median income
demo_data = tc.get_acs(
geography="county",
variables=["B01003_001", "B19013_001"], # Population, Median Income
state="CA",
year=2022
)
Searching for Variables
Find variables by searching their descriptions:
# Search for income-related variables
income_vars = tc.search_variables("income", 2022, "acs", "acs5")
print(income_vars[['name', 'label']].head(10))
Data Formats
Tidy Format (Default)
By default, data is returned in “tidy” format where each row represents one geography-variable combination:
data = tc.get_acs(
geography="state",
variables=["B01003_001", "B19013_001"],
output="tidy" # This is the default
)
# Result: One row per state-variable combination
Wide Format
You can also get data in “wide” format where each row represents one geography:
data = tc.get_acs(
geography="state",
variables=["B01003_001", "B19013_001"],
output="wide"
)
# Result: One row per state, variables as columns
Geographic Levels
pytidycensus supports many geographic levels:
"us"- United States"region"- Census regions"division"- Census divisions"state"- States"county"- Counties"tract"- Census tracts"block group"- Block groups"place"- Places/cities"zcta"- ZIP Code Tabulation Areas
Geographic Filtering
Filter data to specific geographies:
# County data for Texas only
tx_counties = tc.get_acs(
geography="county",
variables="B01003_001",
state="TX"
)
# Tract data for Harris County, Texas
harris_tracts = tc.get_acs(
geography="tract",
variables="B01003_001",
state="TX",
county="201" # Harris County FIPS code
)
We have implemented a county name lookup, so you can also use:
county="Harris County" # instead of FIPS code
Survey Types
The ACS has different survey periods:
"acs5"- 5-year estimates (default, more reliable for small areas)"acs1"- 1-year estimates (more current, less reliable for small areas)
# Get 1-year ACS data
current_data = tc.get_acs(
geography="state",
variables="B01003_001",
survey="acs1",
year=2022
)
Margin of Error
ACS data includes margins of error. These are automatically included:
data = tc.get_acs(
geography="state",
variables="B19013_001"
)
# The result includes both estimate and margin of error
print(data.columns)
# ['GEOID', 'NAME', 'variable', 'value', 'B19013_001_moe']
Population Estimates Program
The Population Estimates Program provides annual population estimates and demographic characteristics. For years 2020 and later, pytidycensus retrieves data from CSV files; for earlier years (2015-2019), it uses the Census API.
Basic Population Estimates
# Get total population by state for 2022
state_pop = tc.get_estimates(
geography="state",
variables="POP",
vintage=2022
)
Components of Population Change
# Get births, deaths, and migration data
components = tc.get_estimates(
geography="state",
variables=["BIRTHS", "DEATHS", "DOMESTICMIG", "INTERNATIONALMIG"],
vintage=2022
)
Demographic Breakdowns
Use the breakdown parameter to get population estimates by demographics:
# Population by sex and race
demographics = tc.get_estimates(
geography="state",
variables="POP",
breakdown=["SEX", "RACE"],
breakdown_labels=True, # Include human-readable labels
year=2022
)
Geographic Levels
Population estimates support multiple geographies:
# County-level data for Texas
tx_counties = tc.get_estimates(
geography="county",
variables="POP",
state="TX",
year=2022
)
# Metro areas (CBSAs)
metros = tc.get_estimates(
geography="cbsa",
variables="POP",
year=2022
)
Time Series Data
Get population estimates across multiple years:
# Time series for states from 2020-2023
time_series = tc.get_estimates(
geography="state",
variables="POP",
time_series=True,
vintage=2023
)
Data Products
Use the product parameter to specify the type of data:
# Basic population totals (default)
population = tc.get_estimates(
geography="state",
product="population", # or omit for default
variables="POP",
year=2022
)
# Components of population change
components = tc.get_estimates(
geography="state",
product="components",
variables=["BIRTHS", "DEATHS"],
year=2022
)
# Population characteristics by demographics
characteristics = tc.get_estimates(
geography="state",
product="characteristics",
variables="POP",
breakdown=["SEX"],
year=2022
)
Advanced Time Series Analysis
pytidycensus provides powerful time series functionality for analyzing demographic changes over time with automatic handling of changing geographic boundaries.
Installation for Time Series
# Install with time series support (includes tobler for area interpolation)
pip install pytidycensus[time]
Basic Time Series Analysis
The get_time_series() function automatically handles boundary changes and variable differences across years:
# ACS time series with automatic boundary handling
data = tc.get_time_series(
geography="tract",
variables={"total_pop": "B01003_001E", "median_income": "B19013_001E"},
years=[2015, 2020],
dataset="acs5",
state="DC",
base_year=2020, # Use 2020 boundaries as reference
extensive_variables=["total_pop"], # Counts/totals
intensive_variables=["median_income"], # Rates/medians
geometry=True,
output="wide"
)
Decennial Census Time Series
Handle different variable codes across decennial years:
# Different variable codes for different years
variables = {
2010: {"total_pop": "P001001"}, # 2010 uses P001001
2020: {"total_pop": "P1_001N"} # 2020 uses P1_001N
}
data = tc.get_time_series(
geography="tract",
variables=variables,
years=[2010, 2020],
dataset="decennial",
state="DC",
base_year=2020,
extensive_variables=["total_pop"]
)
Time Period Comparisons
Use compare_time_periods() for detailed change analysis:
# Systematic comparison between time periods
comparison = tc.compare_time_periods(
data=data,
base_period=2015,
comparison_period=2020,
variables=["total_pop", "median_income"],
calculate_change=True,
calculate_percent_change=True
)
# Results include columns like:
# total_pop_2015, total_pop_2020, total_pop_change, total_pop_pct_change
Key Features
Automatic Area Interpolation: Handles changing tract boundaries using tobler
Variable Classification: Distinguishes between extensive (counts) and intensive (rates) variables
Built-in Validation: Checks interpolation accuracy and data conservation
Flexible Output: Wide format (multi-index columns) or tidy format (long form)
Multiple Datasets: Support for both ACS and Decennial Census
Geographic Boundary Handling
Stable Geographies (state, county): No interpolation needed
Changing Geographies (tract, block group): Automatic area interpolation
Base Year Selection: Choose which year’s boundaries to use as reference
Example: County-Level Analysis (Stable Boundaries)
# For stable geographies, interpolation is automatically skipped
county_data = tc.get_time_series(
geography="county",
variables={"total_pop": "B01003_001E", "median_income": "B19013_001E"},
years=[2018, 2022],
dataset="acs5",
state="CA",
geometry=False # Faster for summary statistics
)
comparison = tc.compare_time_periods(
data=county_data,
base_period=2018,
comparison_period=2022,
variables=["total_pop", "median_income"]
)
For detailed examples and advanced techniques, see the Time Series Analysis Tutorial.
Next Steps
Explore comprehensive Jupyter notebook examples
Check the API reference for detailed function documentation
Visit the GitHub repository for the latest updates
Come study with us at The George Washington University
