Census LLM Assistant
The Census LLM Assistant provides a conversational interface to US Census data using Large Language Models. Instead of learning Census variable codes and geography hierarchies, you can describe your research needs in natural language.
The assistant is specifically trained on pytidycensus functionality and will only recommend pytidycensus functions, never other Census libraries. It includes comprehensive knowledge of common variable codes, geographic levels, and best practices.
Key Features
- Conversational Interface: Ask questions in natural language 
- Smart Variable Selection: Automatically finds appropriate Census variables 
- Selective Normalization: Only suggests denominators when needed (not for medians, rates, etc.) 
- Variable Name Cleaning: Removes confusing ‘E’ suffixes (B19013_001E → B19013_001) 
- Geographic Guidance: Helps choose the right geographic level for your research 
- Code Generation: Produces ready-to-run pytidycensus code 
Quick Start
Command Line Interface (Recommended)
After installation, run the CLI with your API keys from your terminal:
python -m pytidycensus.llm_interface.cli --census-key YOUR_CENSUS_KEY --openai-key YOUR_OPENAI_KEY
Then type your request, e.g.:
📊 You: Get median income by county in Texas
🤔 Thinking...
🏛️  Assistant: To analyze the median income by county in Texas, we will use the `get_acs()` function from pytidycensus to retrieve the necessary data. We will specifically request the median household income variable (`B19013_001E`) along with the total number of households (`B19001_001E`) to ensure proper normalization.
Here's the code example to get the median income by county in Texas for the most recent available year:
import pytidycensus as tc
# Get the median household income and total households for Texas counties
income_data = tc.get_acs(
    geography="county",
    variables=[
        "B19013_001E",  # Median household income
        "B19001_001E",  # Total households (normalization variable)
    ],
    state="TX",  # Specify the state as Texas
    year=2020,  # Choose the desired year (most recent available)
    output="wide"  # Spread variables across columns for better readability
)
# Calculate any derived metrics if needed, such as income per household
income_data['income_per_household'] = income_data['B19013_001E'] / income_data['B19001_001E']
# Display the resulting DataFrame with median income and additional metrics
print(income_data)
In this code:
- We request the median household income ( - B19013_001E) and total households (- B19001_001E) for all counties in Texas in the year 2020.
- The - output="wide"parameter is used to spread the variables across columns in the resulting DataFrame.
- We calculate the income per household by dividing the median household income by the total number of households. 
This code will provide you with a DataFrame containing the median income by county in Texas along with any additional derived metrics you choose to calculate. Feel free to adjust the year or add more variables as needed for further analysis.
In Jupyter Notebook
from pytidycensus.llm_interface import CensusAssistant
# Initialize assistant
assistant = CensusAssistant(
    census_api_key="your_census_api_key",
    openai_api_key="your_openai_key"  # Optional
)
# Ask for data (use await directly in Jupyter)
response = await assistant.chat("Get median income by county in Texas")
print(response)
What you’ll see:
🏛️ Assistant: I'll help you get median income data for Texas counties.
I found the right variable:
• B19013_001E: Median household income
Here's your pytidycensus code:
import pytidycensus as tc
data = tc.get_acs(
    geography="county",
    variables=["B19013_001E"],
    state="TX",
    year=2020,
    output="wide",
    api_key=census_api_key
)
Your data is ready! The column B19013_001 contains median income for each Texas county.
The assistant automatically:
- ✅ Finds the right Census variable ( - B19013_001E)
- ✅ Uses wide format output for easy analysis 
- ✅ Cleans column names ( - B19013_001E→- B19013_001)
- ✅ Generates ready-to-run code 
Try It Right Now (No Setup Required)
If you just want to see how it works without any API keys:
# Test the conversation examples
python tests/test_conversation_to_query.py verbose
# Or see the normalization intelligence demo
python examples/llm_assistant_example.py
These will show you exactly what conversations look like and which variables get normalization suggestions.
Installation
Basic Installation
pip install pytidycensus[LLM]
LLM Dependencies
To use the LLM assistant, you need either OpenAI or a local model like Ollama. In order to get an API key for OpenAI, sign up here or watch this video.
The LLM cannot work without one of these configured:
Option A: OpenAI (Recommended for reliability)
pip install openai
export OPENAI_API_KEY="your_key_here"
Option B: Local Models (Free, requires setup)
pip install ollama
# Install Ollama from https://ollama.ai/
ollama pull llama3.2
ollama serve
Census API Key
Get a free Census API key and set it as an environment variable from your terminal:
# Get key at: https://api.census.gov/data/key_signup.html
export CENSUS_API_KEY="your_census_key_here"
Or use python to set it directly:
import os
os.environ["CENSUS_API_KEY"] = "your_census_key_here"
Command Line Interface
For interactive exploration:
# Start interactive session
python -m pytidycensus.llm_interface.cli
# With API keys if not set in environment
python -m pytidycensus.llm_interface.cli --census-key YOUR_KEY --openai-key YOUR_KEY
Commands in CLI
- help: Show help information
- reset: Start new conversation
- state: Show current conversation state
- export: Save conversation to JSON
- quit: Exit
Configuration Options
LLM Provider Priority
The assistant tries providers in order:
- OpenAI GPT-3.5 Turbo (reliable, ~$0.01 per conversation) 
- Local Ollama (free, requires local setup) 
Custom Configuration
from pytidycensus.llm_interface import LLMManager, OpenAIProvider, OllamaProvider
# Create custom provider setup
providers = [
    OpenAIProvider(model="gpt-4", api_key="your_key"),
    OllamaProvider(model="mixtral:8x7b")  # Larger local model
]
llm_manager = LLMManager(providers)
assistant = CensusAssistant(llm_manager=llm_manager)
Real Example Conversations
These examples are taken directly from our test suite and demonstrate actual conversation flows:
Example 1: Wisconsin County Income Analysis
👤 User: I'm studying household income variations in Wisconsin
🏛️ Assistant: Great! I can help with income data. What geographic level do you need?
👤 User: I want county-level data
🏛️ Assistant: Perfect. County-level analysis provides good detail. What state are you interested in?
👤 User: Use the most recent data available
🏛️ Assistant: 2020 ACS data is a good choice. Should I use the 5-year estimates for better coverage?
👤 User: Generate the pytidycensus code
🏛️ Assistant: I'll generate the pytidycensus code for you now.
📋 Final State: geography=county, variables=['B19013_001'], state=WI, year=2020, dataset=acs5
Generated Code:
import pytidycensus as tc
# Set your Census API key
# Get one at: https://api.census.gov/data/key_signup.html
census_api_key = "YOUR_API_KEY_HERE"
# Get Census data (wide format with cleaned variable names)
data = tc.get_acs(
    geography="county",
    variables=["B19013_001E"],
    state="WI",
    year=2020,
    output="wide",
    api_key=census_api_key
)
# Clean variable names by removing 'E' suffix
column_mapping = {col: col[:-1] for col in data.columns
                  if col.endswith('E') and '_' in col and col.split('_')[0].startswith('B')}
if column_mapping:
    data = data.rename(columns=column_mapping)
    print(f"Cleaned {len(column_mapping)} variable names by removing 'E' suffix")
print(f"Retrieved {data.shape[0]} rows and {data.shape[1]} columns")
print(data.head())
Result: DataFrame with Wisconsin county data where B19013_001E becomes B19013_001
Example 2: DC Inequality Analysis with Normalization
👤 User: I want to study inequality in Washington DC
🏛️ Assistant: I understand. What additional details do you need for this analysis?
👤 User: Get poverty and income data by Census tract
🏛️ Assistant: Census tracts are ideal for neighborhood-level analysis. Which state and county?
👤 User: Include both counts and totals for calculating rates
🏛️ Assistant: I understand. What additional details do you need for this analysis?
👤 User: Use 2020 ACS 5-year data
🏛️ Assistant: 2020 ACS data is a good choice. Should I use the 5-year estimates for better coverage?
👤 User: Yes, generate the code
🏛️ Assistant: I'll generate the pytidycensus code for you now.
📋 Final State: geography=tract, variables=['B17001_002E', 'B17001_001E', 'B19013_001E'], state=DC, year=2020
Generated Code:
import pytidycensus as tc
# Set your Census API key
census_api_key = "YOUR_API_KEY_HERE"
# Get Census data (wide format with cleaned variable names)
data = tc.get_acs(
    geography="tract",
    variables=["B17001_002E", "B17001_001E", "B19013_001E"],
    state="DC",
    year=2020,
    output="wide",
    api_key=census_api_key
)
# Clean variable names by removing 'E' suffix
column_mapping = {col: col[:-1] for col in data.columns
                  if col.endswith('E') and '_' in col and col.split('_')[0].startswith('B')}
if column_mapping:
    data = data.rename(columns=column_mapping)
print(f"Retrieved {data.shape[0]} rows and {data.shape[1]} columns")
print(data.head())
# Calculate poverty rate
data['poverty_rate'] = (data['B17001_002'] / data['B17001_001']) * 100
Result:
- B17001_002E(poverty count) →- B17001_002
- B17001_001E(total for poverty) →- B17001_001
- B19013_001E(median income) →- B19013_001
Notice how the assistant automatically included the normalization variable (B17001_001E) for calculating poverty rates, but did NOT suggest a denominator for median income since it’s already a rate.
Example 3: Spatial Analysis with Geometry
👤 User: I need Wisconsin county income data for mapping
🏛️ Assistant: I understand. What additional details do you need for this analysis?
👤 User: Include geographic boundaries
🏛️ Assistant: I understand. What additional details do you need for this analysis?
👤 User: 2020 ACS data
🏛️ Assistant: 2020 ACS data is a good choice. Should I use the 5-year estimates for better coverage?
👤 User: Generate the mapping-ready code
🏛️ Assistant: I'll generate the pytidycensus code for you now.
📋 Final State: geography=county, variables=['B19013_001'], state=WI, year=2020, geometry=True
Generated Code:
import pytidycensus as tc
# Get Census data (wide format with cleaned variable names)
data = tc.get_acs(
    geography="county",
    variables=["B19013_001E"],
    state="WI",
    year=2020,
    output="wide",
    geometry=True,
    api_key=census_api_key
)
print(f"Retrieved {data.shape[0]} rows and {data.shape[1]} columns")
print(data.head())
# Ready for mapping with GeoPandas
data.plot(column='B19013_001', legend=True)
Result: GeoPandas GeoDataFrame ready for mapping with clean column name B19013_001
Example 4: Direct Population Query
👤 User: Get me total population by state for 2020 decennial census
🏛️ Assistant: ✅ Success! I retrieved your Census data:
**Results**: 51 rows × 4 columns
**Data preview:**
      GEOID        NAME  P1_001
0        01     Alabama 5024279
1        02      Alaska  733391
2        04     Arizona 7151502
3        05    Arkansas 3011524
4        06  California39538223
**Generated code:**
import pytidycensus as tc
# Get Census data (wide format with cleaned variable names)
data = tc.get_decennial(
    geography="state",
    variables=["P1_001N"],
    year=2020,
    output="wide",
    api_key=census_api_key
)
# Clean variable names by removing 'E' suffix
# (Note: Decennial variables use 'N' suffix, so no cleaning needed)
print(f"Retrieved {data.shape[0]} rows and {data.shape[1]} columns")
print(data.head())
Selective Normalization in Action
The assistant now intelligently determines when normalization variables are needed:
✅ Variables that GET normalization suggestions:
- B08301_021E(Workers who walked) → Suggests- B08301_001E(Total workers)
- B25003_002E(Owner occupied units) → Suggests- B25003_001E(Total occupied units)
- B19001_017E(Households with income $200k+) → Suggests- B19001_001E(Total households)
❌ Variables that do NOT get normalization suggestions:
- B19013_001E(Median household income) - Already a median
- B25064_001E(Median gross rent) - Already a median
- B08006_008E(Mean travel time) - Already a mean
- B25119_001E(Housing cost as percentage) - Already a rate
- Any variable ending in - _001E- These ARE the totals
This prevents the confusion of suggesting denominators for variables that don’t need them.
Testing and Development
Run Test Conversations
See actual conversation flows and generated code:
# Run verbose conversation tests
python tests/test_conversation_to_query.py verbose
# Run specific test
python tests/test_conversation_to_query.py verbose wisconsin
python tests/test_conversation_to_query.py verbose dc
Troubleshooting
Common Issues
No LLM providers available:
pip install openai  # For OpenAI
# OR
pip install ollama && ollama serve  # For local models
Census API key issues:
export CENSUS_API_KEY="your_key_here"
# Get free key: https://api.census.gov/data/key_signup.html
Variable not found errors:
- The assistant uses knowledge base + search to find variables 
- Some very specific variables may need manual specification 
- Use - search_variables()function for exploration
Contributing
The LLM interface is designed to be extensible:
- Add new variable mappings in - knowledge_base.py
- Extend geography guidance with local expertise 
- Contribute test cases for new conversation patterns 
- Help improve normalization logic for edge cases 
The goal is making Census data accessible through natural conversation!