udi-training-data

This repository contains code for generating training data consisting of natural language prompts and responses in the form of a Universal Discovery Interface (UDI) specification.

The overall pipeline can be run from main.pynb and consists of a few high-level steps.

Template Generation will create questions and specifications with placeholders for entities and fields as well as constraints for those entities/fields.
Schema Generation will create dataset schemas based on provided datasets.
Template Expansion will expand the template questions/specifications given the provided schemas for all possibilities that satify the constraints.
Paraphraser will use an LLM framework to paraphrase input questions to cover different styles of expertise and formality in the input. Note: this is currently a placeholder and not implemented yet.
Export The data will export the data as a list of data points, each data point will have the following attributes.

query_template: The original query template with and placeholders.
constraints: The list of constraints that data entities and fields must satisfy.
spec_template: The template UDI specification with and placeholders.
query_type: The form of the query, either question or utterance.
creation_method: Will always output template from this script.
query_base: The query with placeholders satisfied. e.g. 'donors' and 'sex' instead of and .
spec: The UDI's specification with entities and fields satisfied.
dataset_schema: The dataset schema name.
query: The paraphrased version of query_base
expertise: The expertise score [1-5] of the paraphrased query.
formality: The formality score [1-5] of the paraphrased query.

Name	Name	Last commit message	Last commit date
Latest commit Dev-Lan add name uniqueness constraint Mar 21, 2025 50039ab · Mar 21, 2025 History 48 Commits
.gitignore	.gitignore	no ds_store!	Mar 21, 2025
LICENSE	LICENSE	Initial commit	Mar 11, 2025
README.md	README.md	Update README.md	Mar 14, 2025
brainstorm.ipynb	brainstorm.ipynb	add unstructured brainstorm notes	Mar 12, 2025
convert_for_finetuning.py	convert_for_finetuning.py	add code to format for fine-tuning	Mar 21, 2025
main.ipynb	main.ipynb	add code to format for fine-tuning	Mar 21, 2025
paraphraser.py	paraphraser.py	full integration working	Mar 14, 2025
pyproject.toml	pyproject.toml	add initial functional ER constraints/questions	Mar 20, 2025
schema_generation.py	schema_generation.py	add initial functional ER constraints/questions	Mar 20, 2025
template_expansion.py	template_expansion.py	add name uniqueness constraint	Mar 21, 2025
template_generation.py	template_generation.py	add name uniqueness constraint	Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

udi-training-data

About

Releases

Packages

Languages

License

hms-dbmi/udi-training-data

Folders and files

Latest commit

History

Repository files navigation

udi-training-data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages