-
-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faceted map example chart #1711
Comments
Try this: import altair as alt
from vega_datasets import data
states = alt.topo_feature(data.us_10m.url, 'states')
source = data.income.url
alt.Chart(source).mark_geoshape().encode(
shape=alt.Shape(field='geo', type='geojson'),
color='pct:Q',
column='group:N',
tooltip=['name:N', 'group:N', 'pct:Q']
).transform_lookup(
lookup='id',
from_=alt.LookupData(data=states, key='id'),
as_='geo'
).properties(
width=75,
height=150
).project(
type='albersUsa'
) I noticed there is no shorthand for |
Here's an example using the LA riots sample dataset import altair as alt
from vega_datasets import data
df = data.la_riots()
n = alt.topo_feature('https://gist.githubusercontent.com/irisslee/70039051188dac8f64e14182b5a459a9/raw/2412c45551cff577f7b10604ca523bd3f4dd31d3/countytopo.json', 'county')
LAbasemap = alt.Chart(n).mark_geoshape(
fill='lightgray',
stroke='white'
).properties(width = 400, height =400).project('mercator')
points = alt.Chart().mark_circle().encode(
longitude = 'longitude:Q',
latitude='latitude:Q',
size = alt.value(15),
color = 'gender:N'
)
alt.layer(LAbasemap, points, data=df).facet('gender:N') |
That's a nice example of the mechanics of a faceted map, but I think for this particular dataset the visualization would be more effective without splitting it across facets. |
What do you see as an ideal example of a faceted map for the gallery?
…On Thu, Oct 3, 2019, 8:21 PM Jake Vanderplas ***@***.***> wrote:
That's a nice example of the mechanics of a faceted map, but I think for
this particular dataset the visualization would be more effective without
splitting it across facets.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1711>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAACOCMC7QXEF63373YFC63QM2ZCZANCNFSM4I3LKKBQ>
.
|
I haven't been able to come up with a good example. |
I add one already in #1714.. |
In news graphics, the most common case for a faceted map is when you want
to create a set of "mini multiples" that compare quantitative values on a
shared scaled across a set of competing nominative values.
A current example would be mapping the location of campaign donors across
America for the 20+ Democratic presidential candidates.
If you want something in that ballpark, I think we should look for a sample
50 state dataset that has a nominative facet where the different categories
show some variety across the country.
…On Thu, Oct 3, 2019, 10:50 PM mattijn ***@***.***> wrote:
I add one already in #1714
<#1714>..
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1711>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAACOCPVUJO7WS7U26KQJGDQM3KQ7ANCNFSM4I3LKKBQ>
.
|
I think facets by time series segment or by a quantitative bracket are interesting, but I'd wager that both are much less common than charts that facet by a nominative category. |
How does a facet by quantitative data look like? Albeit years can be a quantitative data type as well, aren't they used as nominative categories here? import altair as alt
from vega_datasets import data
countries = alt.topo_feature(data.world_110m.url, 'countries')
source = 'https://raw.githubusercontent.com/mattijn/datasets/master/cities_prediction_population.csv'
base = alt.Chart(countries).mark_geoshape(
fill='lightgray',
stroke='white',
strokeWidth=0.2
).properties(width=300, height=200).project('naturalEarth1')
cities = alt.Chart().mark_circle().encode(
latitude='lat:Q',
longitude='lon:Q',
size=alt.Size('population:Q',scale=alt.Scale(range=[0, 1000]), legend=alt.Legend(title="Population (million)")),
fill=alt.value('green'),
stroke=alt.value('white'),
tooltip=['city:N','population:Q']
)
alt.layer(base, cities, data=source).facet(
facet='year:N',
columns=2,
title='The 20 Most Populous Cities in the World by 2100'
) Based on https://www.visualcapitalist.com/animated-map-worlds-populous-cities-2100/ |
Perhaps I am not using the term nominative correctly, but in this example
you give I would say you are still grouping an ordinal time series at the
end of the day.
The result is an example that is slightly more complex, and less common,
than one where the dataset already possesses a simple categorical column,
like politician candidate in my earlier example, or like gender in the one
given by Iris Lee.
…On Fri, Oct 4, 2019, 8:32 AM mattijn ***@***.***> wrote:
How does a facet by quantitative data look like? Albeit years can be a
quantitative data type as well, aren't they used as nominative categories
here?
import altair as altfrom vega_datasets import data
countries = alt.topo_feature(data.world_110m.url, 'countries')
source = 'https://raw.githubusercontent.com/mattijn/datasets/master/cities_prediction_population.csv'
base = alt.Chart(countries).mark_geoshape(
fill='lightgray',
stroke='white',
strokeWidth=0.2
).properties(width=300, height=200).project('naturalEarth1')
cities = alt.Chart().mark_circle().encode(
latitude='lat:Q',
longitude='lon:Q',
size=alt.Size('population:Q',scale=alt.Scale(range=[0, 1000]), legend=alt.Legend(title="Population (million)")),
fill=alt.value('green'),
stroke=alt.value('white'),
tooltip=['city:N','population:Q']
)
alt.layer(base, cities, data=source).facet(
facet='year:N',
columns=2,
title='The 20 Most Populous Cities in the World by 2100'
)
[image: image]
<https://user-images.githubusercontent.com/5186265/66219935-5bd65200-e6cc-11e9-9314-e858a74efd4a.png>
Based on
https://www.visualcapitalist.com/animated-map-worlds-populous-cities-2100/
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1711>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAACOCONVNIMW5ET2KYKF6LQM5OYRANCNFSM4I3LKKBQ>
.
|
Yeah, my example is more ordinal then nominal |
In my opinion, the best Altair examples import from With those requirements, I'm not sure there's a suitable dataset in the current example list other than the LA riots dataset used by @irisslee. However, that set may require the import of outside geographies for the base map, something I think we should also aim to avoid. Unless we can find a good candidate with the examples, or solve the issue of the base map for the riots data, I think we should consider nominating a new example dataset for vega_datasets to document this relatively common news chart. |
@mattijn was this not solved by? If we were going by the issue title/description alone; your example (US Income by State: Wrapped Facet) seems like the solution. Reading through the thread, it isn't clear to me what the additional requirements would be for closing the issue Note I'm trying to do some housekeeping on old issues, e.g. closing, labelling, adding relationships. |
Happy new year @dangotbanned! Maybe something like species richness for a limited number of species. Taken from here, but then preferably aggregated per county, instead of raster cells. |
@mattijn in that case, how about we source a suitable dataset and open an issue in https://github.com/vega/vega-datasets? We could then have access to it after dealing with: |
Sounds good to me! |
Great @mattijn ! I'm not 100% sure the license for the dataset you linked would work for us A potential starting point might be here though https://github.com/datasets |
Two data sources for consideration: Zillow Research Data: https://www.zillow.com/research/data/ A sample US housing dataset derived from the Zillow data. |
A demo I created a few years ago for using the Zillow housing data. I am not quite sure about the dataset license though. Demo: https://www.youtube.com/watch?v=gMghAsNuTbw If the Zillow dataset license does not work for you, https://datacommons.org/ has a lot of other open datasets. |
#1711 (comment), #1711 (comment) Thanks for showing interest in this issue @giswqs! I've just updated the description with an overview of what work I'm thinking we need to do. Hope you find that helpful. I don't personally work with spatial data, but I'm quite impressed by your GitHub profile! 🧠 |
The species data looks like a good candidate for a new dataset. We can probably leverage the county data already present in us-10m.json. The source of the chart above, per the paper, is:
It's probably not necessary to use all 1697 species. Instead, we can focus on four common species:
Data are available for download here but will need some processing. I'll work to put this together and will share a sample visualization to see if it's suitable. |
Source data:
Target visualization:
Questions:
Any practical tips or code snippets would be very welcome. |
The sizes are relatively large indeed. I would use an approach that incorporates geowombat for clipping the extent of each county (snapped to raster) before counting the 'year-round' pixels. It will probably be a process that simmers for a few hours per species, but the following can function as a good starting point to get the job done. Rasterstats also might function, but given the size of the tif, I prefer more control. import geopandas as gpd
import geowombat as gw
import pandas as pd
def compute_count_yearround(gdf_row, bounds, raster_fp):
with gw.config.update(ref_bounds=bounds, ref_tar=raster_fp):
with gw.open(raster_fp, dtype=int) as ds:
ds = ds.gw.mask(gdf_row, keep='in')
# I think 3 equals 'year-round'
count_yearround = (ds == 3).sum().compute().item()
return count_yearround
# bAMROx is 'Bird, American Robin'
raster_fp = "bAMROx_CONUS_HabMap_2001v1.tif"
vector_fp = "us10m.topo.json"
gdf = gpd.read_file(vector_fp, driver='TopoJSON')
gdf = gdf.set_crs(epsg=4326) # degrees
gdf = gdf.to_crs(epsg=5070) # meters; and same crs as `raster_fp`
gdf = gdf[~gdf.is_empty] # filter out empty geoms
# slice first 5 rows to test method
gdf = gdf.iloc[0:5]
pixel_count_county = []
for ix, row in gdf.iterrows():
# print(ix)
bounds = row.geometry.bounds
gdf_row = gdf.loc[ix:ix]
id_count = compute_count_yearround(gdf_row, bounds, raster_fp)
pixel_count_county.append((row.id, id_count, row.geometry.area))
df = pd.DataFrame(pixel_count_county, columns=["id", "bAMROx_count", "area"])
# for area percentage computation use pixel (spatial) resolution of 30 meters
df["bAMROx_pct"] = (df.bAMROx_count * (30*30) / df.area) * 100
df[['id', 'bAMROx_pct']]
|
@mattijn am I understanding this correctly, that you're saying processing this data will take 8-16 hours? |
As is, maybe? Depending on hardware. Can execute in parallel with chunks of counties. But without ingesting the raw geotiffs in dedicated systems, I think this is likely as expected. Potentially can resample to larger pixel sizes before counting to get rough idea of the numbers. |
@mattijn I think the main factor would be this: # NOTE: 3226 rows
gdf = gdf[~gdf.is_empty] # filter out empty geoms
# slice first 5 rows to test method
# gdf = gdf.iloc[0:5] <------------------------------ (Assuming you'd want to use all rows)
for ix, row in gdf.iterrows():
... Because up in this function you would be reading the same file 3226 times: def compute_count_yearround(gdf_row, bounds, raster_fp):
with gw.config.update(ref_bounds=bounds, ref_tar=raster_fp):
with gw.open(raster_fp, dtype=int) as ds:
... Also, pixel_count_county = [
(
row.id,
compute_count_yearround(gdf.loc[ix:ix], row.geometry.bounds, raster_fp),
row.geometry.area,
)
for ix, row in gdf.iterrows()
] Ideally though, avoiding Note I'm not sure what file |
Side note@mattijn changing these arguments to # gpd.read_file(vector_fp, driver='TopoJSON')
gpd.read_file(vector_fp, layer="counties") But it removes these warnings we also get when building the docs: ../altair/.venv/Lib/site-packages/pyogrio/raw.py:198: RuntimeWarning: driver TopoJSON does not support open option DRIVER
return ogr_read(
../altair/.venv/Lib/site-packages/pyogrio/geopandas.py:265: UserWarning: More than one layer found in 'ff7a7e679c46f2d1eb85cc92521b990f1a7a5c7a.json': 'counties' (default), 'states', 'TopoJSON'. Specify layer parameter to avoid this warning.
result = read_func( |
Surely one should only batch the for loop, why would you want to read the vector file so many times? By using the config manager of geowombat one avoids reading the full tiff, and is especially the beauty of this code. I don't think the listcomp is the bottleneck here, I really don't think the numbers are bad as is (given source tif is ~15billion pixels), but always open to learn faster methods to compute this! |
I'm confused, this part (which I'm assuming you mean by vector file) is only reading once in either case vector_fp = "us10m.topo.json"
gdf = gpd.read_file(vector_fp, driver='TopoJSON') I'm suggesting that what you originally proposed would read the same file 3226 times with gw.open(raster_fp, dtype=int) as ds: Maybe this is normal in the spatial domain?
@mattijn Could you link/point to how I'd download |
OK, now I see you included the code section to read geopandas dataframe only to state that it has 3226 row. |
If necessary we can always choose a less common species. e.g. barking treefrog 49mb .... or we could show in a single state or region rather than all USA. .... but it would be cool if the generation script could be configured to allow any species to be processed in a reasonable amount of time. |
Thanks @mattijn! I'd downloaded the right file before ( So the file is at this path (if anyone else is new to this):
EditI'm probably not going to get to this soon, so don't let me be the hold up on any else making progress |
A faster method to compute this is using exactextract. Since recently it provides Python bindings. Using this approach it took for me not yet 8 minutes to compute the fractions in all counties for a single specie. import geopandas as gpd
from exactextract import exact_extract
vector_fp = "us10m.topo.json"
raster_fp = "bAMROx_CONUS_HabMap_2001v1.tif"
gdf = gpd.read_file(vector_fp, layer='counties')
gdf = gdf.set_crs(epsg=4326) # degrees
gdf = gdf.to_crs(epsg=5070) # meters; and same crs as `raster_fp`
gdf = gdf[~gdf.is_empty] # filter out empty geoms
gdf.columns = ['county_id', 'geometry']
gdf = gdf.iloc[0:5]
# default_value: specifies a value to be used for NODATA pixels instead of ignoring them
# use 255, since source raster is 8bit integer
df = exact_extract(
rast=raster_fp,
vec=gdf,
ops=["unique(default_value=255)", "frac(default_value=255)"],
include_cols='county_id',
output='pandas',
progress=True
)
df = df.explode(['unique', 'frac'])
df = df[df.unique == 3].drop('unique', axis=1)
The fractions are slightly off, but close enough, compared to the previous snap-to-raster geowombat approach. Using this approach we are much close to compute all of this in a reasonable amount of time. |
Great - with this, I can make an example visualization to share shortly.
|
Awesome, looking forward! I think this also can be a use-case to highlight the
![]() |
Btw, I realised the Meaning the data can be derived using: import geopandas as gpd
import pandas as pd
import numpy as np
from exactextract import exact_extract
vector_fp = "us10m.topo.json"
gdf = gpd.read_file(vector_fp, layer='counties')
gdf = gdf.set_crs(epsg=4326) # degrees
gdf = gdf.to_crs(epsg=5070) # meters; and same crs as `raster_fp`
gdf = gdf[~gdf.is_empty] # filter out empty geoms
gdf.columns = ['county_id', 'geometry']
raster_fps = [
"bAMROx_CONUS_HabMap_2001v1.tif",
"aAMBUx_CONUS_HabMap_2001v1.tif",
"mWTDEx_CONUS_HabMap_2001v1.tif",
"rCOGAx_CONUS_HabMap_2001v1.tif"
]
df = exact_extract(
rast=raster_fps,
vec=gdf,
ops=["unique(default_value=255)", "frac(default_value=255)"],
include_cols='county_id',
output='pandas',
progress=True
)
# List to store results
all_data = []
for raster in raster_fps:
stem = raster.split('.')[0]
species = stem.split('_')[0]
unique_col = f"{stem}_unique"
frac_col = f"{stem}_frac"
# Convert columns to NumPy arrays
unique_values = np.array(df[unique_col].to_list(), dtype=object)
frac_values = np.array(df[frac_col].to_list(), dtype=object)
# Repeat county_id based on array lengths
county_ids = np.repeat(df["county_id"].values, [len(arr) for arr in unique_values])
# Flatten arrays
unique_flat = np.concatenate(unique_values)
frac_flat = np.concatenate(frac_values)
# Filter where unique == 3
mask = unique_flat == 3
temp_df = pd.DataFrame({
"county_id": county_ids[mask],
"species": species,
"pct": frac_flat[mask]
})
all_data.append(temp_df)
# Combine results into a single DataFrame
df_melted = pd.concat(all_data, ignore_index=True)
df_melted.to_csv('species_counties.csv', index=False) If you do not have the time to do this by yourself, here is the data that comes out of this: species_counties.csv And for the viz I had a quick attempt as such: import altair as alt
import pandas as pd
from vega_datasets import data
alt.data_transformers.disable_max_rows()
# Load data
df = pd.read_csv('species_counties.csv')
counties = alt.topo_feature(data.us_10m.url, 'counties')
# Create maps for each species
charts = [
alt.Chart(counties, title=species)
.mark_geoshape(tooltip=True)
.encode(
color=alt.Color('pct:Q', scale=alt.Scale(scheme='inferno'))
)
.transform_lookup(
lookup='id',
from_=alt.LookupData(data=df[df['species'] == species], key='county_id', fields=['pct'])
)
.project(type='albersUsa')
.properties(width=180, height=130)
for species in df['species'].unique()
]
# Concatenate charts into a grid layout
chart = alt.concat(*charts, columns=2).configure_mark(invalid='show')
chart ![]() I did realised we do something similar in the docs here: https://altair-viz.github.io/user_guide/marks/geoshape.html#facet-a-map. Vega-Editor link: https://vega.github.io/editor/#/gist/5e2141d8e73ea6acd96812bda364ed1d/vl5_species.json |
@mattijn looking quite good
Would you be able to take a look at this PR? There appears to be an abandoned PR that would fix geoshape facet. Would be great if we could have this example working with |
Agree it would be better, but unfortunately I'm not confident with typescript. |
@mattijn Would this arrow file suffice? Do we want to change up the species before setting this for posterity? Below is a sample visualization (and script) to see if the species selection + dataset format is ok. (We can enhance the tooltip in the actual example.) Also, would there be a straightforward way to spot check a single county's calculation of the pct field? If so I'd like to include this in vega-datasets#684 to highlight the accuracy of the technique used. Species Habitat Visualization Codeimport altair as alt
import pyarrow as pa
import requests
import io
from vega_datasets import data
# Download and load the Arrow file
url = "https://github.com/vega/vega-datasets/raw/c4ff52d22bfa2e70a1b40554cc4c56df216838f3/data/species.arrow"
response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
table = pa.ipc.RecordBatchFileReader(io.BytesIO(response.content)).read_all()
# Disable row limit for Altair (if using vegafusion, this may not be needed)
alt.data_transformers.disable_max_rows()
# Load US counties topology
counties = alt.topo_feature(data.us_10m.url, 'counties')
# Create a chart for each unique species (limiting to first 4 for demonstration)
# With pyarrow, we can use filter directly on the table.
species_list = table.column('CommonName').to_pylist()
species_list = list(dict.fromkeys(species_list))[:4] # Get unique species and limit
charts = [
alt.Chart(counties, title=species)
.mark_geoshape(tooltip=True)
.encode(
color=alt.Color(
'percent_habitat:Q',
scale=alt.Scale(
domain=[0, 1],
scheme='viridis',
zero=True,
nice=False
),
title='Habitat Percentage',
legend=alt.Legend(format=".0%") # Format within the Legend object
),
tooltip=[
alt.Tooltip('county_id:N', title='County ID'),
alt.Tooltip('percent_habitat:Q', title='Habitat %', format='.2%')
]
)
.transform_lookup(
lookup='id',
from_=alt.LookupData(
# Filter the pyarrow table directly, no need for to_pandas()
data=table.filter(pa.compute.equal(table.column('CommonName'), species)),
key='county_id',
fields=['percent_habitat']
)
)
.transform_calculate(
percent_habitat="datum.percent_habitat === null ? 0 : datum.percent_habitat"
)
.project(type='albersUsa')
.properties(width=300, height=200)
for species in species_list
]
# Combine charts into a grid
chart = alt.concat(*charts, columns=2).configure_view(
stroke=None
).configure_mark(
invalid='filter'
)
# Display the chart
chart
</details> |
This commit introduces a new example to the Altair documentation showcasing choropleth maps faceted by category. The example visualizes the distribution of suitable habitat for different species across US counties, using the proposed new Species Habitat dataset from vega-datasets (vega/vega-datasets#684). Key features of this example: - Demonstrates the use of `alt.Chart.mark_geoshape()` for geographical visualizations. - Shows how to create faceted maps for comparing categorical data across geographic regions. - Utilizes the `transform_lookup` and `transform_calculate` transforms for data manipulation within Altair. - Uses a CSV data file temporarily hosted in the vega-datasets repository branch (pending dataset merge). This example addresses issue #1711, which requested a faceted map example for the Altair documentation. Co-authored-by: Mattijn van Hoek <mattijn@gmail.com>
Tasks
Based on (#1711 (comment)), (#1711 (comment))
Tip
The tasks below should be completed in order of appearance.
Earlier tasks can be performed by any contributor (new or old) familiar with
alt.Chart.mark_geoshape
Example (not
altair
)(#1711 (comment)) @mattijn
Maybe something like species richness for a limited number of species.
Similar to below, from here, but then preferably aggregated per county, instead of raster cells:
The text was updated successfully, but these errors were encountered: