Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faceted map example chart #1711

Open
6 tasks done
palewire opened this issue Sep 27, 2019 · 40 comments
Open
6 tasks done

Faceted map example chart #1711

palewire opened this issue Sep 27, 2019 · 40 comments
Labels
documentation good first issue vega: vega-datasets Requires upstream action in `vega-datasets`

Comments

@palewire
Copy link
Contributor

palewire commented Sep 27, 2019

Tasks

Based on (#1711 (comment)), (#1711 (comment))

Tip

The tasks below should be completed in order of appearance.
Earlier tasks can be performed by any contributor (new or old) familiar with alt.Chart.mark_geoshape

  • Find an appropriate dataset
    • Contains a categorical field, not directly dependent on location
    • License that enables adding to vega/vega-datasets and using/transforming within vega projects (non-commercial)
      • Current dataset licenses: 1, 2
  • Demonstrate using the dataset (in this issue)
  • Open an upstream issue, linking back to this one
  • Follow up by opening an upstream PR

Example (not altair)

(#1711 (comment)) @mattijn

Maybe something like species richness for a limited number of species.
Similar to below, from here, but then preferably aggregated per county, instead of raster cells:

Image
@mattijn
Copy link
Contributor

mattijn commented Sep 30, 2019

Try this:

import altair as alt
from vega_datasets import data

states = alt.topo_feature(data.us_10m.url, 'states')
source = data.income.url

alt.Chart(source).mark_geoshape().encode(
    shape=alt.Shape(field='geo', type='geojson'),
    color='pct:Q',
    column='group:N',
    tooltip=['name:N', 'group:N', 'pct:Q']
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(data=states, key='id'),
    as_='geo'
).properties(
    width=75,
    height=150
).project(
    type='albersUsa'
)

image

I noticed there is no shorthand for type='geojson' (otherwise you could do something as shape='geo:G'). It's also not mentioned in the Altair docs, where it is in the Vega-Lite docs

@irisslee
Copy link

irisslee commented Oct 4, 2019

Here's an example using the LA riots sample dataset

import altair as alt
from vega_datasets import data

df = data.la_riots()

n = alt.topo_feature('https://gist.githubusercontent.com/irisslee/70039051188dac8f64e14182b5a459a9/raw/2412c45551cff577f7b10604ca523bd3f4dd31d3/countytopo.json', 'county')

LAbasemap = alt.Chart(n).mark_geoshape(
    fill='lightgray',
    stroke='white'
).properties(width = 400, height =400).project('mercator')

points = alt.Chart().mark_circle().encode( 
    longitude = 'longitude:Q',
    latitude='latitude:Q',
    size = alt.value(15), 
    color = 'gender:N'
)

alt.layer(LAbasemap, points, data=df).facet('gender:N') 

visualization

@jakevdp
Copy link
Collaborator

jakevdp commented Oct 4, 2019

That's a nice example of the mechanics of a faceted map, but I think for this particular dataset the visualization would be more effective without splitting it across facets.

@palewire
Copy link
Contributor Author

palewire commented Oct 4, 2019 via email

@jakevdp
Copy link
Collaborator

jakevdp commented Oct 4, 2019

I haven't been able to come up with a good example.

@mattijn
Copy link
Contributor

mattijn commented Oct 4, 2019

I add one already in #1714..

@palewire
Copy link
Contributor Author

palewire commented Oct 4, 2019 via email

@mattijn
Copy link
Contributor

mattijn commented Oct 4, 2019

@palewire
Copy link
Contributor Author

palewire commented Oct 4, 2019

I think facets by time series segment or by a quantitative bracket are interesting, but I'd wager that both are much less common than charts that facet by a nominative category.

@mattijn
Copy link
Contributor

mattijn commented Oct 4, 2019

How does a facet by quantitative data look like? Albeit years can be a quantitative data type as well, aren't they used as nominative categories here?

import altair as alt
from vega_datasets import data

countries = alt.topo_feature(data.world_110m.url, 'countries')
source = 'https://raw.githubusercontent.com/mattijn/datasets/master/cities_prediction_population.csv'

base = alt.Chart(countries).mark_geoshape(
    fill='lightgray',
    stroke='white',
    strokeWidth=0.2
).properties(width=300, height=200).project('naturalEarth1')

cities = alt.Chart().mark_circle().encode( 
    latitude='lat:Q',    
    longitude='lon:Q',
    size=alt.Size('population:Q',scale=alt.Scale(range=[0, 1000]), legend=alt.Legend(title="Population (million)")),
    fill=alt.value('green'),
    stroke=alt.value('white'),
    tooltip=['city:N','population:Q']
)

alt.layer(base, cities, data=source).facet(
    facet='year:N', 
    columns=2, 
    title='The 20 Most Populous Cities in the World by 2100'
)

image

Based on https://www.visualcapitalist.com/animated-map-worlds-populous-cities-2100/

@palewire
Copy link
Contributor Author

palewire commented Oct 4, 2019 via email

@mattijn
Copy link
Contributor

mattijn commented Oct 4, 2019

Yeah, my example is more ordinal then nominal

@palewire
Copy link
Contributor Author

palewire commented Oct 4, 2019

In my opinion, the best Altair examples import from vega_datasets and do not require any transformation of data prior to plotting.

With those requirements, I'm not sure there's a suitable dataset in the current example list other than the LA riots dataset used by @irisslee. However, that set may require the import of outside geographies for the base map, something I think we should also aim to avoid.

Unless we can find a good candidate with the examples, or solve the issue of the base map for the riots data, I think we should consider nominating a new example dataset for vega_datasets to document this relatively common news chart.

@dangotbanned
Copy link
Member

@mattijn was this not solved by?

If we were going by the issue title/description alone; your example (US Income by State: Wrapped Facet) seems like the solution.

Reading through the thread, it isn't clear to me what the additional requirements would be for closing the issue

Note

I'm trying to do some housekeeping on old issues, e.g. closing, labelling, adding relationships.

@mattijn
Copy link
Contributor

mattijn commented Jan 1, 2025

Happy new year @dangotbanned!
This is still a valid issue in a sense that there is currently not a good dataset that describes what @palewire is after.

Maybe something like species richness for a limited number of species.
Similar to:

image

Taken from here, but then preferably aggregated per county, instead of raster cells.

@dangotbanned
Copy link
Member

#1711 (comment)

@mattijn in that case, how about we source a suitable dataset and open an issue in https://github.com/vega/vega-datasets?

We could then have access to it after dealing with:

@mattijn
Copy link
Contributor

mattijn commented Jan 1, 2025

Sounds good to me!

@dangotbanned
Copy link
Member

Sounds good to me!

Great @mattijn !

I'm not 100% sure the license for the dataset you linked would work for us
https://creativecommons.org/licenses/by-nc-nd/4.0/

A potential starting point might be here though https://github.com/datasets

@giswqs
Copy link

giswqs commented Jan 4, 2025

Two data sources for consideration:

Zillow Research Data: https://www.zillow.com/research/data/
Data Commons: https://datacommons.org/

A sample US housing dataset derived from the Zillow data.
https://github.com/opengeos/data/blob/main/housing/Core/RDC_Inventory_Core_Metrics_State.csv

@giswqs
Copy link

giswqs commented Jan 4, 2025

A demo I created a few years ago for using the Zillow housing data. I am not quite sure about the dataset license though.

Demo: https://www.youtube.com/watch?v=gMghAsNuTbw
Web App: https://huggingface.co/spaces/giswqs/Streamlit/blob/main/pages/2_%F0%9F%8F%A0_U.S._Housing.py

If the Zillow dataset license does not work for you, https://datacommons.org/ has a lot of other open datasets.

@dangotbanned
Copy link
Member

#1711 (comment), #1711 (comment)

Thanks for showing interest in this issue @giswqs!

I've just updated the description with an overview of what work I'm thinking we need to do. Hope you find that helpful.

I don't personally work with spatial data, but I'm quite impressed by your GitHub profile! 🧠

@dangotbanned dangotbanned added the vega: vega-datasets Requires upstream action in `vega-datasets` label Jan 7, 2025
@dsmedia
Copy link
Contributor

dsmedia commented Feb 4, 2025

The species data looks like a good candidate for a new dataset. We can probably leverage the county data already present in us-10m.json.

The source of the chart above, per the paper, is:

We obtained gridded 30-m × 30-m resolution maps of suitable habitat for 1697 species and subspecies (hereafter “species”) of mammals, birds, amphibians, and reptiles for CONUS through the US Geological Survey’s Gap Analysis Program (U.S. Geological Survey Gap Analysis Program, 2018). We focus on these species because methods of producing suitable habitat were consistent across species for the entire area.

It's probably not necessary to use all 1697 species. Instead, we can focus on four common species:

  • American Robin (Turdus migratorius)
  • American Bullfrog (Lithobates catesbeianus
  • White-tailed Deer (Odocoileus virginianus)
  • Common Garter Snake (Thamnophis sirtalis)

Data are available for download here but will need some processing.

I'll work to put this together and will share a sample visualization to see if it's suitable.

@dsmedia
Copy link
Contributor

dsmedia commented Feb 5, 2025

@giswqs @mattijn

Source data:

  • USGS GAP habitat maps (30m resolution GeoTIFF)
  • Each cell indicates species presence in winter, summer, both, or neither
  • Files are large (~300MB each) for 4 different species

Target visualization:

  • Need to display this on a county-level US map
  • Using the us-10m.json county boundaries from vega-datasets

Questions:

  • What's the best approach to aggregate the high-resolution presence/absence data to the county level? Should I use a binary indicator (species present/absent in county) or calculate the percentage of suitable habitat per county? If binary, what should be the threshold to determine presence? I'm trying to understand both the technical process of this aggregation in code (zonal statistics from GeoTIFF to GeoJSON) and which metric would be most meaningful for visualization.

Any practical tips or code snippets would be very welcome.

@mattijn
Copy link
Contributor

mattijn commented Feb 6, 2025

The sizes are relatively large indeed. I would use an approach that incorporates geowombat for clipping the extent of each county (snapped to raster) before counting the 'year-round' pixels.

It will probably be a process that simmers for a few hours per species, but the following can function as a good starting point to get the job done. Rasterstats also might function, but given the size of the tif, I prefer more control.

import geopandas as gpd
import geowombat as gw
import pandas as pd

def compute_count_yearround(gdf_row, bounds, raster_fp):
    with gw.config.update(ref_bounds=bounds, ref_tar=raster_fp):
        with gw.open(raster_fp, dtype=int) as ds:
            ds = ds.gw.mask(gdf_row, keep='in')
            # I think 3 equals 'year-round'
            count_yearround = (ds == 3).sum().compute().item()
    return count_yearround

# bAMROx is 'Bird, American Robin'
raster_fp = "bAMROx_CONUS_HabMap_2001v1.tif"
vector_fp = "us10m.topo.json"

gdf = gpd.read_file(vector_fp, driver='TopoJSON')
gdf = gdf.set_crs(epsg=4326)  # degrees
gdf = gdf.to_crs(epsg=5070)  # meters; and same crs as `raster_fp`
gdf = gdf[~gdf.is_empty]  # filter out empty geoms
# slice first 5 rows to test method
gdf = gdf.iloc[0:5]

pixel_count_county = []
for ix, row in gdf.iterrows():
    # print(ix)
    bounds = row.geometry.bounds
    gdf_row = gdf.loc[ix:ix]
    id_count = compute_count_yearround(gdf_row, bounds, raster_fp)
    pixel_count_county.append((row.id, id_count, row.geometry.area))

df = pd.DataFrame(pixel_count_county, columns=["id", "bAMROx_count", "area"])
# for area percentage computation use pixel (spatial) resolution of 30 meters
df["bAMROx_pct"] = (df.bAMROx_count * (30*30) / df.area) * 100
df[['id', 'bAMROx_pct']]
id bAMROx_pct
0 53000 14.035112
1 53073 71.105570
2 30105 15.965316
3 30029 89.271899
4 16021 85.322979

@dangotbanned
Copy link
Member

It will probably be a process that simmers for a few hours per species

@mattijn am I understanding this correctly, that you're saying processing this data will take 8-16 hours?

@mattijn
Copy link
Contributor

mattijn commented Feb 6, 2025

As is, maybe? Depending on hardware. Can execute in parallel with chunks of counties. But without ingesting the raw geotiffs in dedicated systems, I think this is likely as expected. Potentially can resample to larger pixel sizes before counting to get rough idea of the numbers.

@dangotbanned
Copy link
Member

#1711 (comment)

@mattijn I think the main factor would be this:

# NOTE: 3226 rows
gdf = gdf[~gdf.is_empty]  # filter out empty geoms
# slice first 5 rows to test method
# gdf = gdf.iloc[0:5] <------------------------------ (Assuming you'd want to use all rows)
for ix, row in gdf.iterrows():
    ...

Because up in this function you would be reading the same file 3226 times:

def compute_count_yearround(gdf_row, bounds, raster_fp):
    with gw.config.update(ref_bounds=bounds, ref_tar=raster_fp):
        with gw.open(raster_fp, dtype=int) as ds:
            ...

Also, pixel_count_county could be inlined to a listcomp:

pixel_count_county = [
    (
        row.id,
        compute_count_yearround(gdf.loc[ix:ix], row.geometry.bounds, raster_fp),
        row.geometry.area,
    )
    for ix, row in gdf.iterrows()
]

Ideally though, avoiding .iterrows() would probably yield a bigger perf boost.

Note

I'm not sure what file "bAMROx_CONUS_HabMap_2001v1.tif" is referring to and haven't used geowombat before

@dangotbanned
Copy link
Member

Side note

@mattijn changing these arguments to gpd.read_file seems to give the same result:

# gpd.read_file(vector_fp, driver='TopoJSON')
gpd.read_file(vector_fp, layer="counties")

But it removes these warnings we also get when building the docs:

../altair/.venv/Lib/site-packages/pyogrio/raw.py:198: RuntimeWarning: driver TopoJSON does not support open option DRIVER
  return ogr_read(
../altair/.venv/Lib/site-packages/pyogrio/geopandas.py:265: UserWarning: More than one layer found in 'ff7a7e679c46f2d1eb85cc92521b990f1a7a5c7a.json': 'counties' (default), 'states', 'TopoJSON'. Specify layer parameter to avoid this warning.
  result = read_func(

@mattijn
Copy link
Contributor

mattijn commented Feb 6, 2025

Surely one should only batch the for loop, why would you want to read the vector file so many times? By using the config manager of geowombat one avoids reading the full tiff, and is especially the beauty of this code. I don't think the listcomp is the bottleneck here, I really don't think the numbers are bad as is (given source tif is ~15billion pixels), but always open to learn faster methods to compute this!

@dangotbanned
Copy link
Member

why would you want to read the vector file so many times?

I'm confused, this part (which I'm assuming you mean by vector file) is only reading once in either case

vector_fp = "us10m.topo.json"

gdf = gpd.read_file(vector_fp, driver='TopoJSON')

I'm suggesting that what you originally proposed would read the same file 3226 times

with gw.open(raster_fp, dtype=int) as ds:

Maybe this is normal in the spatial domain?
Generally though, I'd expect reading the file once or somehow reading it with some predicates applied would be ideal

I really don't think the numbers are bad as is (given source tif is ~15billion pixels), but always open to learn faster methods to compute this!

@mattijn Could you link/point to how I'd download "bAMROx_CONUS_HabMap_2001v1.tif" please?
Would be fun to experiment a little bit

@mattijn
Copy link
Contributor

mattijn commented Feb 6, 2025

OK, now I see you included the code section to read geopandas dataframe only to state that it has 3226 row.
I'm indeed iterating over the same tif file but every time with another bounds. To download the data go to https://gapanalysis.usgs.gov/apps/species-data-download/, search for American Robin (top-right) and then click download habitat data. The tif is zipped I thought.

@dsmedia
Copy link
Contributor

dsmedia commented Feb 6, 2025

If necessary we can always choose a less common species. e.g. barking treefrog 49mb

.... or we could show in a single state or region rather than all USA.

.... but it would be cool if the generation script could be configured to allow any species to be processed in a reasonable amount of time.

@dangotbanned
Copy link
Member

dangotbanned commented Feb 6, 2025

#1711 (comment)

Thanks @mattijn!

I'd downloaded the right file before (AmericanRobin_T.zip), but hadn't checked what was in it 😅

So the file is at this path (if anyone else is new to this):

AmericanRobin_T.zip/bAMROx_CONUS_HabMap_2001v1.zip/bAMROx_CONUS_HabMap_2001v1.tif

Edit

I'm probably not going to get to this soon, so don't let me be the hold up on any else making progress

@mattijn
Copy link
Contributor

mattijn commented Feb 8, 2025

A faster method to compute this is using exactextract. Since recently it provides Python bindings. Using this approach it took for me not yet 8 minutes to compute the fractions in all counties for a single specie.

import geopandas as gpd
from exactextract import exact_extract

vector_fp = "us10m.topo.json"
raster_fp = "bAMROx_CONUS_HabMap_2001v1.tif"

gdf = gpd.read_file(vector_fp, layer='counties')
gdf = gdf.set_crs(epsg=4326)  # degrees
gdf = gdf.to_crs(epsg=5070)  # meters; and same crs as `raster_fp`
gdf = gdf[~gdf.is_empty]  # filter out empty geoms
gdf.columns = ['county_id', 'geometry']
gdf = gdf.iloc[0:5]

# default_value: specifies a value to be used for NODATA pixels instead of ignoring them
# use 255, since source raster is 8bit integer
df = exact_extract(
    rast=raster_fp, 
    vec=gdf, 
    ops=["unique(default_value=255)", "frac(default_value=255)"], 
    include_cols='county_id', 
    output='pandas',
    progress=True
)
df = df.explode(['unique', 'frac'])
df = df[df.unique == 3].drop('unique', axis=1)
county_id frac
0 53000 0.138258
1 53073 0.710393
2 30105 0.159549
3 30029 0.892068
4 16021 0.852625

The fractions are slightly off, but close enough, compared to the previous snap-to-raster geowombat approach. Using this approach we are much close to compute all of this in a reasonable amount of time.

@dsmedia
Copy link
Contributor

dsmedia commented Feb 9, 2025

Great - with this, I can make an example visualization to share shortly.

A faster method to compute this is using exactextract. Since recently it provides Python bindings. Using this approach it took for me not yet 8 minutes to compute the fractions in all counties for a single specie.

@mattijn
Copy link
Contributor

mattijn commented Feb 9, 2025

Awesome, looking forward! I think this also can be a use-case to highlight the show mode for showing invalid values (see: https://vega.github.io/vega-lite/docs/invalid-data.html#show, available since vega-altair version 5.4.0)

import altair as alt
import pandas as pd
from vega_datasets import data

df = pd.read_csv('bAMROx_counties.csv')
counties = alt.topo_feature(data.us_10m.url, 'counties')
source = data.unemployment.url

alt.Chart(counties).mark_geoshape().encode(
    color=alt.Color('frac:Q').scale(scheme='lightgreyteal')
).transform_lookup(
    lookup='id',
    from_=alt.LookupData(data=df, key='county_id', fields=['frac'])
).project(
    type='albersUsa'
).configure_mark(invalid='show')
Image

@mattijn
Copy link
Contributor

mattijn commented Feb 9, 2025

Btw, I realised the exact_extract() accepts also lists of raster input (see CLI-docs on multiple input files). It can then reuse the computation for each county/raster overlay. I was able to compute all in just over 20 minutes. Quite impressive, I've to say!

Meaning the data can be derived using:

import geopandas as gpd
import pandas as pd
import numpy as np
from exactextract import exact_extract

vector_fp = "us10m.topo.json"
gdf = gpd.read_file(vector_fp, layer='counties')
gdf = gdf.set_crs(epsg=4326)  # degrees
gdf = gdf.to_crs(epsg=5070)  # meters; and same crs as `raster_fp`
gdf = gdf[~gdf.is_empty]  # filter out empty geoms
gdf.columns = ['county_id', 'geometry']

raster_fps = [
    "bAMROx_CONUS_HabMap_2001v1.tif",
    "aAMBUx_CONUS_HabMap_2001v1.tif",
    "mWTDEx_CONUS_HabMap_2001v1.tif",
    "rCOGAx_CONUS_HabMap_2001v1.tif"
]

df = exact_extract(
    rast=raster_fps, 
    vec=gdf, 
    ops=["unique(default_value=255)", "frac(default_value=255)"], 
    include_cols='county_id', 
    output='pandas',
    progress=True
)

# List to store results
all_data = []

for raster in raster_fps:
    stem = raster.split('.')[0]
    species = stem.split('_')[0]
    
    unique_col = f"{stem}_unique"
    frac_col = f"{stem}_frac"

    # Convert columns to NumPy arrays
    unique_values = np.array(df[unique_col].to_list(), dtype=object)
    frac_values = np.array(df[frac_col].to_list(), dtype=object)

    # Repeat county_id based on array lengths
    county_ids = np.repeat(df["county_id"].values, [len(arr) for arr in unique_values])

    # Flatten arrays
    unique_flat = np.concatenate(unique_values)
    frac_flat = np.concatenate(frac_values)

    # Filter where unique == 3
    mask = unique_flat == 3
    temp_df = pd.DataFrame({
        "county_id": county_ids[mask],
        "species": species,
        "pct": frac_flat[mask]
    })

    all_data.append(temp_df)

# Combine results into a single DataFrame
df_melted = pd.concat(all_data, ignore_index=True)
df_melted.to_csv('species_counties.csv', index=False)

If you do not have the time to do this by yourself, here is the data that comes out of this: species_counties.csv

And for the viz I had a quick attempt as such:

import altair as alt
import pandas as pd
from vega_datasets import data

alt.data_transformers.disable_max_rows()

# Load data
df = pd.read_csv('species_counties.csv')
counties = alt.topo_feature(data.us_10m.url, 'counties')

# Create maps for each species
charts = [
    alt.Chart(counties, title=species)
    .mark_geoshape(tooltip=True)
    .encode(
        color=alt.Color('pct:Q', scale=alt.Scale(scheme='inferno'))
    )
    .transform_lookup(
        lookup='id',
        from_=alt.LookupData(data=df[df['species'] == species], key='county_id', fields=['pct'])
    )
    .project(type='albersUsa')
    .properties(width=180, height=130)
    for species in df['species'].unique()
]

# Concatenate charts into a grid layout
chart = alt.concat(*charts, columns=2).configure_mark(invalid='show')

chart
Image

I did realised we do something similar in the docs here: https://altair-viz.github.io/user_guide/marks/geoshape.html#facet-a-map.

Vega-Editor link: https://vega.github.io/editor/#/gist/5e2141d8e73ea6acd96812bda364ed1d/vl5_species.json

@dangotbanned
Copy link
Member

dangotbanned commented Feb 9, 2025

#1711 (comment)

@mattijn looking quite good

And for the viz I had a quick attempt as such:
...

Would you be able to take a look at this PR?

There appears to be an abandoned PR that would fix geoshape facet.

Would be great if we could have this example working with .facet(...) rather than a listcomp.
If only for simplicity

@mattijn
Copy link
Contributor

mattijn commented Feb 9, 2025

Agree it would be better, but unfortunately I'm not confident with typescript.

@dsmedia
Copy link
Contributor

dsmedia commented Feb 18, 2025

@mattijn Would this arrow file suffice? Do we want to change up the species before setting this for posterity? Below is a sample visualization (and script) to see if the species selection + dataset format is ok. (We can enhance the tooltip in the actual example.)

Also, would there be a straightforward way to spot check a single county's calculation of the pct field? If so I'd like to include this in vega-datasets#684 to highlight the accuracy of the technique used.

Image

Species Habitat Visualization Code
import altair as alt
import pyarrow as pa
import requests
import io
from vega_datasets import data

# Download and load the Arrow file
url = "https://github.com/vega/vega-datasets/raw/c4ff52d22bfa2e70a1b40554cc4c56df216838f3/data/species.arrow"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
table = pa.ipc.RecordBatchFileReader(io.BytesIO(response.content)).read_all()

# Disable row limit for Altair (if using vegafusion, this may not be needed)
alt.data_transformers.disable_max_rows()

# Load US counties topology
counties = alt.topo_feature(data.us_10m.url, 'counties')

# Create a chart for each unique species (limiting to first 4 for demonstration)
#  With pyarrow, we can use filter directly on the table.
species_list = table.column('CommonName').to_pylist()
species_list = list(dict.fromkeys(species_list))[:4]  # Get unique species and limit

charts = [
    alt.Chart(counties, title=species)
    .mark_geoshape(tooltip=True)
    .encode(
        color=alt.Color(
            'percent_habitat:Q',
            scale=alt.Scale(
                domain=[0, 1],
                scheme='viridis',
                zero=True,
                nice=False
            ),
            title='Habitat Percentage',
            legend=alt.Legend(format=".0%")  # Format within the Legend object
        ),
        tooltip=[
            alt.Tooltip('county_id:N', title='County ID'),
            alt.Tooltip('percent_habitat:Q', title='Habitat %', format='.2%')
        ]
    )
    .transform_lookup(
        lookup='id',
        from_=alt.LookupData(
            # Filter the pyarrow table directly, no need for to_pandas()
            data=table.filter(pa.compute.equal(table.column('CommonName'), species)),
            key='county_id',
            fields=['percent_habitat']
        )
    )
      .transform_calculate(
        percent_habitat="datum.percent_habitat === null ? 0 : datum.percent_habitat"
    )
    .project(type='albersUsa')
    .properties(width=300, height=200)
    for species in species_list
]

# Combine charts into a grid
chart = alt.concat(*charts, columns=2).configure_view(
    stroke=None
).configure_mark(
    invalid='filter'
)

# Display the chart
chart
</details>

dsmedia added a commit that referenced this issue Feb 22, 2025
This commit introduces a new example to the Altair documentation showcasing  choropleth maps faceted by category.

The example visualizes the distribution of suitable habitat for different species across US counties, using the proposed new Species Habitat dataset from vega-datasets (vega/vega-datasets#684).

Key features of this example:

- Demonstrates the use of `alt.Chart.mark_geoshape()` for geographical visualizations.
- Shows how to create faceted maps for comparing categorical data across geographic regions.
- Utilizes the `transform_lookup` and `transform_calculate` transforms for data manipulation within Altair.
- Uses a CSV data file temporarily hosted in the vega-datasets repository branch (pending dataset merge).

This example addresses issue #1711, which requested a faceted map example for the Altair documentation.

Co-authored-by: Mattijn van Hoek <mattijn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation good first issue vega: vega-datasets Requires upstream action in `vega-datasets`
Projects
None yet
Development

No branches or pull requests

7 participants