Census LLM Assistant

The Census LLM Assistant provides a conversational interface to US Census data using Large Language Models. Instead of learning Census variable codes and geography hierarchies, you can describe your research needs in natural language.

The assistant is specifically trained on pytidycensus functionality and will only recommend pytidycensus functions, never other Census libraries. It includes comprehensive knowledge of common variable codes, geographic levels, and best practices.

Key Features

Conversational Interface: Ask questions in natural language
Smart Variable Selection: Automatically finds appropriate Census variables
Selective Normalization: Only suggests denominators when needed (not for medians, rates, etc.)
Variable Name Cleaning: Removes confusing ‘E’ suffixes (B19013_001E → B19013_001)
Geographic Guidance: Helps choose the right geographic level for your research
Code Generation: Produces ready-to-run pytidycensus code

Quick Start

Command Line Interface (Recommended)

After installation, run the CLI with your API keys from your terminal:

python -m pytidycensus.llm_interface.cli --census-key YOUR_CENSUS_KEY --openai-key YOUR_OPENAI_KEY

Then type your request, e.g.:

📊 You: Get median income by county in Texas

🤔 Thinking...

🏛️  Assistant: To analyze the median income by county in Texas, we will use the `get_acs()` function from pytidycensus to retrieve the necessary data. We will specifically request the median household income variable (`B19013_001E`) along with the total number of households (`B19001_001E`) to ensure proper normalization.

Here's the code example to get the median income by county in Texas for the most recent available year:

import pytidycensus as tc

# Get the median household income and total households for Texas counties
income_data = tc.get_acs(
    geography="county",
    variables=[
        "B19013_001E",  # Median household income
        "B19001_001E",  # Total households (normalization variable)
    ],
    state="TX",  # Specify the state as Texas
    year=2020,  # Choose the desired year (most recent available)
    output="wide"  # Spread variables across columns for better readability
)

# Calculate any derived metrics if needed, such as income per household
income_data['income_per_household'] = income_data['B19013_001E'] / income_data['B19001_001E']

# Display the resulting DataFrame with median income and additional metrics
print(income_data)

In this code:

We request the median household income (B19013_001E) and total households (B19001_001E) for all counties in Texas in the year 2020.
The output="wide" parameter is used to spread the variables across columns in the resulting DataFrame.
We calculate the income per household by dividing the median household income by the total number of households.

This code will provide you with a DataFrame containing the median income by county in Texas along with any additional derived metrics you choose to calculate. Feel free to adjust the year or add more variables as needed for further analysis.

In Jupyter Notebook

from pytidycensus.llm_interface import CensusAssistant

# Initialize assistant
assistant = CensusAssistant(
    census_api_key="your_census_api_key",
    openai_api_key="your_openai_key"  # Optional
)

# Ask for data (use await directly in Jupyter)
response = await assistant.chat("Get median income by county in Texas")
print(response)

What you’ll see:

🏛️ Assistant: I'll help you get median income data for Texas counties.

I found the right variable:
• B19013_001E: Median household income

Here's your pytidycensus code:

import pytidycensus as tc

data = tc.get_acs(
    geography="county",
    variables=["B19013_001E"],
    state="TX",
    year=2020,
    output="wide",
    api_key=census_api_key
)

Your data is ready! The column B19013_001 contains median income for each Texas county.

The assistant automatically:

✅ Finds the right Census variable (B19013_001E)
✅ Uses wide format output for easy analysis
✅ Cleans column names (B19013_001E → B19013_001)
✅ Generates ready-to-run code

Try It Right Now (No Setup Required)

If you just want to see how it works without any API keys:

# Test the conversation examples
python tests/test_conversation_to_query.py verbose

# Or see the normalization intelligence demo
python examples/llm_assistant_example.py

These will show you exactly what conversations look like and which variables get normalization suggestions.

Installation

Basic Installation

pip install pytidycensus[LLM]

LLM Dependencies

To use the LLM assistant, you need either OpenAI or a local model like Ollama. In order to get an API key for OpenAI, sign up here or watch this video.

The LLM cannot work without one of these configured:

Option A: OpenAI (Recommended for reliability)

pip install openai
export OPENAI_API_KEY="your_key_here"

Option B: Local Models (Free, requires setup)

pip install ollama
# Install Ollama from https://ollama.ai/
ollama pull llama3.2
ollama serve

Census API Key

Get a free Census API key and set it as an environment variable from your terminal:

# Get key at: https://api.census.gov/data/key_signup.html
export CENSUS_API_KEY="your_census_key_here"

Or use python to set it directly:

import os
os.environ["CENSUS_API_KEY"] = "your_census_key_here"

Command Line Interface

For interactive exploration:

# Start interactive session
python -m pytidycensus.llm_interface.cli

# With API keys if not set in environment
python -m pytidycensus.llm_interface.cli --census-key YOUR_KEY --openai-key YOUR_KEY

Commands in CLI

help: Show help information
reset: Start new conversation
state: Show current conversation state
export: Save conversation to JSON
quit: Exit

Configuration Options

LLM Provider Priority

The assistant tries providers in order:

OpenAI GPT-3.5 Turbo (reliable, ~$0.01 per conversation)
Local Ollama (free, requires local setup)

Custom Configuration

from pytidycensus.llm_interface import LLMManager, OpenAIProvider, OllamaProvider

# Create custom provider setup
providers = [
    OpenAIProvider(model="gpt-4", api_key="your_key"),
    OllamaProvider(model="mixtral:8x7b")  # Larger local model
]

llm_manager = LLMManager(providers)
assistant = CensusAssistant(llm_manager=llm_manager)

Real Example Conversations

These examples are taken directly from our test suite and demonstrate actual conversation flows:

Example 1: Wisconsin County Income Analysis

👤 User: I'm studying household income variations in Wisconsin
🏛️ Assistant: Great! I can help with income data. What geographic level do you need?

👤 User: I want county-level data
🏛️ Assistant: Perfect. County-level analysis provides good detail. What state are you interested in?

👤 User: Use the most recent data available
🏛️ Assistant: 2020 ACS data is a good choice. Should I use the 5-year estimates for better coverage?

👤 User: Generate the pytidycensus code
🏛️ Assistant: I'll generate the pytidycensus code for you now.

📋 Final State: geography=county, variables=['B19013_001'], state=WI, year=2020, dataset=acs5

Generated Code:

import pytidycensus as tc

# Set your Census API key
# Get one at: https://api.census.gov/data/key_signup.html
census_api_key = "YOUR_API_KEY_HERE"

# Get Census data (wide format with cleaned variable names)
data = tc.get_acs(
    geography="county",
    variables=["B19013_001E"],
    state="WI",
    year=2020,
    output="wide",
    api_key=census_api_key
)

# Clean variable names by removing 'E' suffix
column_mapping = {col: col[:-1] for col in data.columns
                  if col.endswith('E') and '_' in col and col.split('_')[0].startswith('B')}
if column_mapping:
    data = data.rename(columns=column_mapping)
    print(f"Cleaned {len(column_mapping)} variable names by removing 'E' suffix")

print(f"Retrieved {data.shape[0]} rows and {data.shape[1]} columns")
print(data.head())

Result: DataFrame with Wisconsin county data where B19013_001E becomes B19013_001

Example 2: DC Inequality Analysis with Normalization

👤 User: I want to study inequality in Washington DC
🏛️ Assistant: I understand. What additional details do you need for this analysis?

👤 User: Get poverty and income data by Census tract
🏛️ Assistant: Census tracts are ideal for neighborhood-level analysis. Which state and county?

👤 User: Include both counts and totals for calculating rates
🏛️ Assistant: I understand. What additional details do you need for this analysis?

👤 User: Use 2020 ACS 5-year data
🏛️ Assistant: 2020 ACS data is a good choice. Should I use the 5-year estimates for better coverage?

👤 User: Yes, generate the code
🏛️ Assistant: I'll generate the pytidycensus code for you now.

📋 Final State: geography=tract, variables=['B17001_002E', 'B17001_001E', 'B19013_001E'], state=DC, year=2020

Generated Code:

import pytidycensus as tc

# Set your Census API key
census_api_key = "YOUR_API_KEY_HERE"

# Get Census data (wide format with cleaned variable names)
data = tc.get_acs(
    geography="tract",
    variables=["B17001_002E", "B17001_001E", "B19013_001E"],
    state="DC",
    year=2020,
    output="wide",
    api_key=census_api_key
)

# Clean variable names by removing 'E' suffix
column_mapping = {col: col[:-1] for col in data.columns
                  if col.endswith('E') and '_' in col and col.split('_')[0].startswith('B')}
if column_mapping:
    data = data.rename(columns=column_mapping)

print(f"Retrieved {data.shape[0]} rows and {data.shape[1]} columns")
print(data.head())

# Calculate poverty rate
data['poverty_rate'] = (data['B17001_002'] / data['B17001_001']) * 100

Result:

B17001_002E (poverty count) → B17001_002
B17001_001E (total for poverty) → B17001_001
B19013_001E (median income) → B19013_001

Notice how the assistant automatically included the normalization variable (B17001_001E) for calculating poverty rates, but did NOT suggest a denominator for median income since it’s already a rate.

Example 3: Spatial Analysis with Geometry

👤 User: I need Wisconsin county income data for mapping
🏛️ Assistant: I understand. What additional details do you need for this analysis?

👤 User: Include geographic boundaries
🏛️ Assistant: I understand. What additional details do you need for this analysis?

👤 User: 2020 ACS data
🏛️ Assistant: 2020 ACS data is a good choice. Should I use the 5-year estimates for better coverage?

👤 User: Generate the mapping-ready code
🏛️ Assistant: I'll generate the pytidycensus code for you now.

📋 Final State: geography=county, variables=['B19013_001'], state=WI, year=2020, geometry=True

Generated Code:

import pytidycensus as tc

# Get Census data (wide format with cleaned variable names)
data = tc.get_acs(
    geography="county",
    variables=["B19013_001E"],
    state="WI",
    year=2020,
    output="wide",
    geometry=True,
    api_key=census_api_key
)

print(f"Retrieved {data.shape[0]} rows and {data.shape[1]} columns")
print(data.head())

# Ready for mapping with GeoPandas
data.plot(column='B19013_001', legend=True)

Result: GeoPandas GeoDataFrame ready for mapping with clean column name B19013_001

Example 4: Direct Population Query

👤 User: Get me total population by state for 2020 decennial census
🏛️ Assistant: ✅ Success! I retrieved your Census data:

**Results**: 51 rows × 4 columns
**Data preview:**
      GEOID        NAME  P1_001
0        01     Alabama 5024279
1        02      Alaska  733391
2        04     Arizona 7151502
3        05    Arkansas 3011524
4        06  California39538223

**Generated code:**

import pytidycensus as tc

# Get Census data (wide format with cleaned variable names)
data = tc.get_decennial(
    geography="state",
    variables=["P1_001N"],
    year=2020,
    output="wide",
    api_key=census_api_key
)

# Clean variable names by removing 'E' suffix
# (Note: Decennial variables use 'N' suffix, so no cleaning needed)
print(f"Retrieved {data.shape[0]} rows and {data.shape[1]} columns")
print(data.head())

Selective Normalization in Action

The assistant now intelligently determines when normalization variables are needed:

✅ Variables that GET normalization suggestions:

B08301_021E (Workers who walked) → Suggests B08301_001E (Total workers)
B25003_002E (Owner occupied units) → Suggests B25003_001E (Total occupied units)
B19001_017E (Households with income $200k+) → Suggests B19001_001E (Total households)

❌ Variables that do NOT get normalization suggestions:

B19013_001E (Median household income) - Already a median
B25064_001E (Median gross rent) - Already a median
B08006_008E (Mean travel time) - Already a mean
B25119_001E (Housing cost as percentage) - Already a rate
Any variable ending in _001E - These ARE the totals

This prevents the confusion of suggesting denominators for variables that don’t need them.

Testing and Development

Run Test Conversations

See actual conversation flows and generated code:

# Run verbose conversation tests
python tests/test_conversation_to_query.py verbose

# Run specific test
python tests/test_conversation_to_query.py verbose wisconsin
python tests/test_conversation_to_query.py verbose dc

Troubleshooting

Common Issues

No LLM providers available:

pip install openai  # For OpenAI
# OR
pip install ollama && ollama serve  # For local models

Census API key issues:

export CENSUS_API_KEY="your_key_here"
# Get free key: https://api.census.gov/data/key_signup.html

Variable not found errors:

The assistant uses knowledge base + search to find variables
Some very specific variables may need manual specification
Use search_variables() function for exploration

Contributing

The LLM interface is designed to be extensible:

Add new variable mappings in knowledge_base.py
Extend geography guidance with local expertise
Contribute test cases for new conversation patterns
Help improve normalization logic for edge cases

The goal is making Census data accessible through natural conversation!