Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spatial Harvesting Cleanup #5084

Open
1 task
jbrown-xentity opened this issue Feb 12, 2025 · 4 comments
Open
1 task

Spatial Harvesting Cleanup #5084

jbrown-xentity opened this issue Feb 12, 2025 · 4 comments
Assignees
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0

Comments

@jbrown-xentity
Copy link
Contributor

jbrown-xentity commented Feb 12, 2025

User Story

In order to support spatial search on CKAN and harvester2.0, data.gov admins want spatial field filled out according to ckanext-spatial spec.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN a DCAT-US file with valid spatial field is harvested
    WHEN the object is prepared for ingestion to CKAN
    THEN the old-spatial field is utilized for raw input AND valid spatial metadata is filled out.

Background

This was done as a pre-processing step in the old CKAN system: https://github.com/GSA/ckanext-geodatagov/blob/main/ckanext/geodatagov/logic.py#L415-L538

Security Considerations (required)

None

Sketch

Major pieces:

  • Download and setup updated places database (old one is in catalog setup, but should get latest from geonames) in harvester DB as custom table (we know this is sparse, details TBD)
  • Walk through the use case in the background link.
  • Develop function that first tries to parse and pull out the spatial bounding box
    • If valid geojson, leave as is (stringified json).
    • If point or simple list of lat/long, create a point geojson object
    • Hit database to see if matching entry
    • Leave TODO on xml/gml, and anything that is unexpected. Write entry to log, leave raw input as old-spatial, and leave spatial field blank.

Make sure unit tests exist for all use cases, and is easy to confirm what the outputs for given inputs will be.

@jbrown-xentity jbrown-xentity added the H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0 label Feb 12, 2025
@btylerburton btylerburton moved this to 📥 Queue in data.gov team board Feb 13, 2025
@rshewitt rshewitt moved this from 📥 Queue to 🏗 In Progress [8] in data.gov team board Feb 26, 2025
@rshewitt rshewitt self-assigned this Feb 26, 2025
@rshewitt
Copy link
Contributor

the locations table data used in the catalog setup appears to be the us postal data on geonames. this is a smaller dataset compared to the gazetteer dataset but considering how infrequently geonames are used on catalog and in an effort to not bog down the db with more records than we need it seems like continuing to use this dataset ( still getting the latest version ) makes sense

@rshewitt
Copy link
Contributor

DDL for "locations" table. this reflects what's in the postal code documentation. i changed the accuracy field to varchar to make the load from txt easier. dropping it anyways.

CREATE TABLE locations (
  country_code char(2),
  postal_code varchar(20),
  place_name varchar(180),
  admin_name1 varchar(100),
  admin_code1 varchar(20),
  admin_name2 varchar(100),
  admin_code2 varchar(20),
  admin_name3 varchar(100),
  admin_code3 varchar(20),
  latitude float,
  longitude float,
  accuracy varchar(20)
);

-- load the data using \copy in psql
-- \copy locations from 'US.txt' with delimiter E'\t'

-- drop the columns we don't need
ALTER TABLE locations 
DROP COLUMN country_code,
DROP COLUMN postal_code,
DROP COLUMN admin_name1,
DROP COLUMN admin_code1,
DROP COLUMN admin_name2,
DROP COLUMN admin_code2,
DROP COLUMN admin_name3,
DROP COLUMN admin_code3,
DROP COLUMN accuracy;

-- convert lat/lon to point geometry
ALTER TABLE locations ADD COLUMN geom geometry(Point, 4326);
UPDATE locations SET geom = ST_SetSRID(ST_MakePoint(longitude, latitude), 4326);

ALTER TABLE locations
DROP COLUMN latitude,
DROP COLUMN longitude;
mydb=# select * from locations limit 1;
 place_name |                        geom                        
------------+----------------------------------------------------
 APO AE     | 0101000020E6100000B459F5B9DA8AFFBFA301BC0512DC4940

@rshewitt
Copy link
Contributor

it occurred to me the free geonames datasets are all in point form but the records in the locations table are polygons. apparently the dataset we use existed before @FuhuXia. we can continue to use that in the meantime until we find a more recent version of it

@tdlowden
Copy link
Member

Some work will be done by tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-Runner Harvest Source Processing for Harvesting 2.0
Projects
Status: 🏗 In Progress [8]
Development

No branches or pull requests

3 participants