Skip to content

code for generating training data used for fine-tuning the LLM

License

Notifications You must be signed in to change notification settings

hms-dbmi/udi-training-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

50039ab · Mar 21, 2025

History

48 Commits
Mar 21, 2025
Mar 11, 2025
Mar 14, 2025
Mar 12, 2025
Mar 21, 2025
Mar 21, 2025
Mar 14, 2025
Mar 20, 2025
Mar 20, 2025
Mar 21, 2025
Mar 21, 2025

Repository files navigation

udi-training-data

This repository contains code for generating training data consisting of natural language prompts and responses in the form of a Universal Discovery Interface (UDI) specification.

The overall pipeline can be run from main.pynb and consists of a few high-level steps.

  1. Template Generation will create questions and specifications with placeholders for entities and fields as well as constraints for those entities/fields.
  2. Schema Generation will create dataset schemas based on provided datasets.
  3. Template Expansion will expand the template questions/specifications given the provided schemas for all possibilities that satify the constraints.
  4. Paraphraser will use an LLM framework to paraphrase input questions to cover different styles of expertise and formality in the input. Note: this is currently a placeholder and not implemented yet.
  5. Export The data will export the data as a list of data points, each data point will have the following attributes.
  • query_template: The original query template with and placeholders.
  • constraints: The list of constraints that data entities and fields must satisfy.
  • spec_template: The template UDI specification with and placeholders.
  • query_type: The form of the query, either question or utterance.
  • creation_method: Will always output template from this script.
  • query_base: The query with placeholders satisfied. e.g. 'donors' and 'sex' instead of and .
  • spec: The UDI's specification with entities and fields satisfied.
  • dataset_schema: The dataset schema name.
  • query: The paraphrased version of query_base
  • expertise: The expertise score [1-5] of the paraphrased query.
  • formality: The formality score [1-5] of the paraphrased query.

About

code for generating training data used for fine-tuning the LLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published