-
Notifications
You must be signed in to change notification settings - Fork 934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synthesis of user research when using configuration in Kedro #891
Comments
I find this very interesting. A lot of the teams I've seen love/have gotten in the habit of creating a physical dataset for everything. I've gotten feedback when I've implicitly left something as a Is it possible to also see how often people read back datasets (after pipeline run), especially intermediate datasets? We don't care at all about storage/cost (and apparently performance), so we turn on versioning and write every dataset to disk. How often do we look at them? I'm guessing at the end of a bigger client project you've got 100K-1M+ datasets on disk, and people have only every looked at <1K. This need to write every dataset to disk manifests itself when you're reusing modular pipelines, and want to generate I guess what I'm getting at is, maybe the configuration is fine, and a lot of the teams I've seen are way overdoing their catalogs. 😂 |
Not a fan of pattern matching; will copy my comments to @hamzaoza here:
Sidebar: Just give me autoformatting with |
Interesting analysis! I personally like Jinja2 template system. The only problem I sow with it regards readability. |
Thanks for this amazing piece of work @hamzaoza - I'm also quite impressed with how dbt works with Jinja, where they have concise SQL models at rest, but the compiled fully materialized SQL is available for debugging. Perhaps, the same approach could be used to allow people to write concise, complex catalogs - but allow users to materialise them in the format that Kedro sees at runtime? |
I've tried to consolidate my thinking and would like to present 4 prototypes for ways we can take this research forward and turn into features. Please comment, interact, react to the below: |
[Prototype 1] Robust support for environment variablesAbstractWe have multiple examples of people making a trivial change to The following order of precedence would apply here: User need justification
Implementation todayToday one can introduce environment variables into their @hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
return TemplatedConfigLoader(
conf_paths,
globals_pattern="*globals.yml",
globals_dict={k: v for k, v in os.environ if k.startswith("KEDRO_")}
) Proposal 1.1 - Environment variable pattern matchingIn this proposal we introduce a new keyword argument that allows the user to specify a regular expression that matches environment variables and includes them in scope. Many of our users do similar things, this is simply providing a convenience function for doing so. @hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
return TemplatedConfigLoader(
conf_paths,
globals_pattern="*globals.yml",
env_var_key_pattern=['^KEDRO.+'] # regex pattern
) Proposal 1.2 - Extending Proposal 1 with specific features for credential managementA related area that people bring up is the way Kedro handles credential management, particularly since the enterprise world has become more sophisticated in the last couple of years with vendor led solutions such as HashiCorp Vault and Kerberos. Environment variables have a part to play and perhaps the following change could help in the same sort of way: @hook_impl
def register_config_loader(self, conf_paths: Iterable[str]) -> ConfigLoader:
return TemplatedConfigLoader(
conf_paths,
globals_pattern="*globals.yml",
env_var_credential_pattern=['^AWS.+ACCES_KEY.*'], # regex pattern
evn_var_mapping= {
# optional mechanism to rename certain keys
'AWS_ACCESS_KEY_ID' : 's3_token'
'AWS_SECRET_ACCESS_KEY' : 'secret_mapping'
}
) Closing thoughts
|
[Protoype 2] Support dynamically generated configurationAbstractIt is clear from supporting the product that users want to define their pipelines dynamically. This is partly because we as programmers want to follow the DRY (Do not repeat yourself) principles and because large Kedro projects will end in the user duplicating lots of configuration (configuration environments is one example where this is unavoidable) Our thinking on this has been heavily influence by the Google SRE cookbook which outlines the exact journey we as a Kedro team have gone on:
The solution that we land at will ultimately fall into category 3,4 or 6. Traditionally, the Kedro team has been resistant to going in this direction because it inevitably makes configuration less readable and more difficult for newbies to understand. Perhaps this proposal when combined with [Idea 3], perhaps adding a compilation step mitigates this readability point. User need and justificationSince introducing Jinja2 support to Proposal 2.1 - Jinja2
Currently we does not support one key Jinja2 feature which would improve the developer experience: importing and reusing macros. via the
In general the main arguments against going down this route are:
Proposal 2.2 - Pattern matching
Negative feedback from this proposal focused on:
Proposal 2.3 - Jsonnet
Closing ThoughtsA core principle in Kedro is that there should be only one obvious way of doing things - this means that all of the proposals above are mutually exculsive. This is a high priority for Kedro - but it's a big decision choosing which horse to back. |
[Prototype 4] A more consistent and robust mechanism for providing configuration overrides via the CLIAbstractThe run arguments available via the Kedro CLI have evolved to date organically. Today there are 3 mechanisms for injecting some configuration into vanilla Kedro via the CLI:
User need justificationThe request to provide CLI overrides comes up relatively frequently examples 1,2,3 as well as multiple references on internally facing channels. Much of this stems from a desire to separate the business logic (nodes, pipelines, models) from the inputs and outputs (catalog +credentials, parameters). Kedro provides a separation of these concerns, but they are still situated within the codebase on the user's file system. This separation become a higher priority in production deployments for several key reasons.
The proposals in the post will follow following order of precedence with CLI overrides taking the highest priority before dropping down to the other levels. Proposal 4.1 - Allow users to specify a point to a zipped version of the
|
Actually @datajoely we have a |
Not Jinja, but Helm charts use templated configuration on YAML targets and is considered an industry standard: https://helm.sh/docs/chart_template_guide/functions_and_pipelines/, you often see things like:
|
@mzjp2 - Yeah spotted that, in general the whitespaccing issue with Jinja isn't a dealbreaker - it just highlights that you need a series of hacks (or yet another DSL in this case) in to make it work well for YAML. |
@datajoely I am not sure I am a big fan of this syntax, it looks quite strange and not very user-friendly... When I was saying that we can specify where to get the config from, I meant something as simple as providing the folder manually. E.g. currently you can run your project in three ways:
In all of those, there's a hard assumption that your current directory contains the config. What I would like is to make that assumption rather a soft one, i.e. the user should be able to specify where the config is located by pointing to a folder or a .tar.gz file. Something like this:
Why is that functionality useful? This can help a lot with deploying configuration and packaging. E.g. when you do |
@idanov what would you imagine the syntax to look like instead? For reference the proposal 4.1 would make it look like this:
|
Community PR #927 further suggests that more complex CLI override facilities (Proposal 4) are desired |
For the record, my team has surcharged the CLI to add this option and this is exactly how we deploy our applications. The option is called |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Was lead here after trying to solve my own problem via current jinja2 implementation in #1532. My thought is: declarative is good, but a fairly sophisticated solution is need to cover all the use cases. I'm a big fan of terraform -- even if it has some idiosyncrasies. But it does have variables, loops, etc -- goes well beyond simple pattern matching. Pattern matching can be stretched, but when you stretch it it also gets complex and hard to read.
(See devspace for a yaml solution that quite extremely flexible -- variables, profiles with pattern replacement, and use of helm templates if need be. No loops though. Or Helm itself.)
Afterthought: You could probably write a terraform extension that allows declaration in terraform syntax, and maintains correspondence with local yaml files. This requires a complex outside dependency, but might be simplest path to a sophisticated declarative way to do configuration. |
Configuration has had a complete overhaul with the new |
For reference, In 0.19, we have |
I think we could go further here @noklam but there is some overlap, yes |
Summary
Configuration overhead is an issue that has arisen time and time again from user feedback – particularly as Kedro projects scale in complexity. From user interviews, it can be observed the three main configurations used were for
kedro run
, the Data Catalog and parameters. The remaining options were seen as “setup once and forgot” for the remainder of the project. Overall configuration in Kedro is well received and liked by users who appreciate the approach Kedro has taken so far.During this research, it became clear that configuration scaling impacts a small set of use cases where you have multiple environments (e.g.
dev
,staging
andprod
) and multiple use cases – maybe you’re using the same or a similar pipeline across different products for different countries. To gather deeper insights participants were presented with two existing options for the Data Catalog, and two possible solutions: pattern matching and Jinja templating (favouring the former of the two). Users were also asked about their feelings about moving the Data Catalog entirely to Python. Participants were universally against the idea of moving the Data Catalog into Python as it would fundamentally go against the principles of Kedro.Table of Contents
1. Introduction
Configuration overhead is an issue that has arisen time and time again from user feedback – particularly as Kedro projects scale in complexity. It’s also an issue for new users who have never been exposed to this concept i.e., Data Scientists using software engineering principles for the first time. This research aims to understand the key pain points users face when using configuration and test possible solutions for the Data Catalog to develop a specification criterion for any solution.
2. Background
Kedro is influenced by the 12 Factor App but this results in a lot of duplication of configuration. From users, we have heard that yaml files can become unwieldy with each entry written manually making it error prone. Users also want to apply runtime parameters and want to parameterise runs in complex ways which Kedro doesn’t currently support.
As a result, some teams have tried to solve this independently – most notably by using Jinja2 templating through the template config loader though this has not become widespread across other teams. However, as we continue to grow, it is likely that more users will encounter similar issues and will need a Kedro native solution to support growth.
Finally, this is not a problem unique to Kedro. Google SREs have already faced a similar issue in the past who have outlined their thoughts and experiences here.
3. Research Approach
To develop a holistic overview of configuration in Kedro, a journalistic approach was used. Therefore, we were looking to answer the following questions:
Note: There is some overlap in the last two questions.
Research Scope
To help keep things manageable, the primary focus of this research was on the Data Catalog and how users interact with it. Nonetheless, pain points for other forms of configuration in Kedro were also captured and will be discussed later. Therefore, elements like parameters, credentials, etc. were not explicitly user tested. Furthermore, custom solutions created by teams may be referenced but will not be considered in the overall solution as they are not Kedro native features.
4. User Interview Matrix
In total, 19 interviews (lasting 1 hour each) across personas and experience levels were conducted to capture a spectrum of views. The user matrix breakdown is shown below.
Note: External users were sourced from Kedro Discord
5. Configuration Synthesis
kedro run
What technology is currently used to support this configuration?
Jinja
Where in the Kedro project can the user make this configuration?
src/<project-package>/settings.py
pyproject.toml
kedro run --config **.yml
export KEDRO_ENV=xyz
src/<project-name>/hooks.py
conf/**/credentials.yml
conf/base/**.yml
conf/local/**.yml
conf/**/**.yml
export KEDRO_ENV=**
conf/**/parameters.yml
kedro run --params param_key1:value1,param_key2:2.0
kedro run --config **.yml
conf/**/catalog.yml
Who is the lead user responsible for this configuration?
How does the user feel about this approach?
What do users like about this approach?
• Standard project structure
• Easy to collaborate with others
• Provides great defaults out of the box
• Easy to ramp up a Kedro project
• Can use the –pipeline flag to run specific branches of code• Can git commit a config.yml file to reduce run errors
• Overall, one of the easiest things to work with
• Enables automation and scaling of Kedro
• Easy to collaborate with others
• A properly written hook can save lots of time
• Works as it should and is seamless
• Can handle a variety of credentials out of the box
• Each person can have their own setup to access data
• Enables a structured approach to dev/qa/prod
• Globals.yml can be different for each environment
• Decouples code and config
• Helps teams test and prototype in environments in a risk-free way
• Easy and straightforward to use
• Easy to read and maintain
• Like the “params:” prefix to quickly identify them in code
• Declarative syntax makes it easy to use, read and debug
• Simplification of I/O
• Decouples code and I/O
• Already has many data connectors built in
• Transcoding datasets
What are the pain points of this configuration?
• Running into issues with
kedro install
on Windows• Changes to hooks and pipeline registry between versions
• Arguments in the terminal are not version controlled
• --nodes is node_names in the yml file
• You need some knowledge to setup - not easy for beginners
• Can reduce transparency of code.
• Users might have the idea - but they don't always find it easy to implement
• Jinja was not well received by clients
• For beginners, can be a little hard to grasp why credentials are separated from the Data Catalog or code
• Feels misaligned with CI/CD tooling
• The inheritance pattern of local / custom / base can be hard for new users to pick up
• Parameters not inheriting base keys and you need to overwrite the entire entry
• Repetition and duplication of files
• Can grow to large files leading a very nested dictionary
• Cannot have ranges or step increments
• Little IDE support means you need to follow the logic yourself
• Duplication of files
• Minor changes to entries need to be applied everywhere - can be difficult to sync
• Not easy to write a custom class for unsupported datasets
• For some teams, YAML anchors are beyond their skillset
• Very long catalog files
What new features are users requesting to support their work?
• More documentation for migrations with breaking changes
• Hooks for when a model starts and ends
• Have nested dependencies in globals.yml
• Enable flexible inheritance across environments
• Provision to separate use cases and environments
• Implement namespaces to parameters
• More dynamic entries i.e., ranges
• Address the repetition and duplication of catalogs
• More guidance on picking the best datatype for an entry
• Support more upcoming datasets i.e., TensorFlow
Overall configuration in Kedro is well received and liked by users. No column had a particularly negative response and users largely understood and appreciated the approach Kedro has taken so far. During this exercise, it became clear that configuration scaling impacts a small set of use cases summarised in the table below.
This would indicate that large configuration files are mostly seen internally often on large analytics project. This stems from Kedro not supporting multiple uses in a monorepo, therefore, forcing the user to use Config Environments as a stop-gap solution. This however then prevents teams from using it for its intended purpose of separating development environments.
6. GitHub Analysis
To support qualitative insights from user research, a custom GitHub query was created to gather quantitative on the Data Catalog.
At the time of running (18 Aug 2021) this presented 411 results of which 138 were real Kedro Data Catalog files. Note, empty Data Catalogs, spaceflights or iris examples and non Kedro projects were manually filtered out. This query assumes that these files are representative of open-source users and that Data Catalogs follow the
/conf/
folder structure. Furthermore, it’s impossible to determine if these are complete files of finished projects or still under development.From this, it was found that only 9% of users were using YAML anchors and only 2% were using
globals.yml
. However, 89% of users were using some type of namespacing in their catalog entries. Furthermore, the number of Data Catalog entries per file were counted. From the histogram below, Data Catalog entries peak around 10.7. Data Catalog Generator
To better understand what users need from the Data Catalog, users were presented with possible options using prototype code. Participants were presented with two existing options for the Data Catalog, and two possible solutions: pattern matching and Jinja templating (with users favouring the former of the two). Users were also asked about their feelings about moving the Data Catalog entirely to Python. Here, participants were universally against the idea of moving the Data Catalog into Python as it would fundamentally go against the principles of Kedro.
• Declarative syntax makes it easy to use, read and debug
• Simplification of I/O
• Decouples code and I/O
• Already has many data connectors built in
• Transcoding datasets
• Still easy to read and debug
• Built in YAML feature, so used in other tools that use YAML
• Still somewhat declarative
• Drastically reduces the number lines
• Viewed beginner friendly
• Takes away additional steps of having declare new files in the Data Catalog
• Somewhat established in the Python world so may have already used it elsewhere
• Reduces the number lines but not as much as Pattern Matching
• Greater control between memory and file datasets
• Access to StackOverflow to help debug issues
• Duplication of files
• Minor changes to entries need to be applied everywhere - can be difficult to sync
• Not easy to write a custom class for unsupported datasets
• For some teams, YAML anchors are beyond their skillset
• Very long Data Catalog files
• Users were using it without knowing they are using it
• Getting accustomed to the notation can take a while to learn and fully understand
• Sub-keys are declared elsewhere which impacts readability
• Concern about the order of operations
• Doesn't work for raw datasets
• Breaks when the files have different schema definitions in the Data Catalog entries
• Concern about unintended consequences
• Doesn't solve the file duplication problem
• Same naming structure doesn't mean files have the same structure
• Doesn't work for raw datasets
• User experience suggest beginners struggle to use and understand it - some teams have even removed it completely from their work
• Can over complicate the Data Catalog with logic
• Breaks when the files have different schema definitions in the Data Catalog entries
• Doesn't solve the file duplication problem
• Bigger learning curve compared to previous options
• Whitespace control can be difficult to manage
• Mixes code and I/O which goes against Kedro principles
• Considered very unfriendly - especially for non-tech users
• Huge concerns on giving too much freedom to users who might abuse this flexibility
8. Solution Criteria
While it was important to test the ideas, it was even more important to understand the criteria of a successful solution that would improve the experience of using the Data Catalog. Therefore, users identified the following 7 components:
The text was updated successfully, but these errors were encountered: