-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling case sensitive table and column names #163
Comments
@gpodevijn good question! So the main rationale is that it's easier to query identifier values in PostgreSQL which all match. The reason for this is that PostgreSQL by default ignores case when querying values. ie, the following queries match: SELECT a, b, c
FROM FoO SELECT A, B, c
FROM foo In order to query by case sensitive values, we have to enclose the above in double quotes. It's a minor barrier to working with a produced schema, but alas. The other reason for this is that this is what Stitch Data does. Per our docs here we try to match up with their resulting data store whenever possible to make use of their community as best as possible. |
@AlexanderMann thanks a lot for the explanations! I have three follow-up questions. When you're saying In order to query by case sensitive values, we have to enclose the above in double quotes -- is that correct that you're saying this from a user/data consumer perspective (i.e., there wouldn't be any implications for the target or any other tap)? Matching up what Stitch Data does makes perfect sense to me. However, if I have a use case where I'd like case sensitive table and column names in my Postgres database, would it be safe for me to fork your repo and apply the appropriate changes (the regexes in my first post)? Finally, do you think that it would make sense to add a parameter to your current implementation where we could choose between lower case (current implementation) or case sensitive identifier names (mentioning that it'd break compatibility with Stitch Data) ? |
@gpodevijn I tried to answer everything as best I could. Please feel free to follow up. I tried to note our past decisions where I could, and then also what my interpretation of the various questions was, etc. etc. 1
I think the answer here is "yes" but I'm a little confused by the question. What follows is my understanding of the question along with rationale presented in a slightly different way to see if any of it helps 😄 The intention behind the decision is to make querying for a data-analyst simpler. ie. if I as a data-scientist have to remember to Building applications on top of a Singer target is slippery in the first place, so it's often expected that dealing with further processing post-Singer will require some wrangling etc. ie, it's difficult to expect fields to come over 1-1. 2
Most likely yes. But support for that will vary depending on the amount of changes etc. You should be fine, but etc. 🤷♀ Question for you!I would be curious to hear from you what your use case here is? Typically name collisions in a schema are due to a number of conflicting things, not casing alone. I'm curious if you want to have tables named 3
Soooooooooo...no ❌ 😄 We discussed this at some point, but the problem you run into is that various folks are going to want various re-mapping etc. For you, you want case sensitivity to be the only thing differing from "what Stitch does". For someone else, they'll want to ditch the Our hope by providing a single function to overwrite for name-canonicalization was that intrepid users (such as yourself) could easily fork the repo and get the additional functionality and push the support of that particular decision onto the developer instead of a potentially hairy feature flag in our codebase. |
@AlexanderMann this is a very detailed answer, thank you very much. I understand your arguments and your decisions! So my use case is this one: Our data scientists want all table names and column names to be snake_case. We have an ETL process that loads our database tables into Redshift. When columns are CamelCase, the ETL process renames the columns in snake_case based on a set of specific rules (e.g., MyColumn becomes my_column but ARR becomes arr and not a_r_r). The rules to rename a column from CamelCase to snake_case tries to be as generic as possible (e.g., ARR is not hard coded in the rules). This is actually a set of horrible regexes 🙂 Having all our column names in lower case in postgres prevents me from using the set of regex rules, as MyColumn would become mycolumn and it would be much more difficult to rename it to my_column. I hope what I'm saying makes sense to you 😄 In conclusion, I will fork the repo and implement my changes ⌨️ 🙂 |
Godspeed @gpodevijn ...we've all been there... Out of curiousity, why are you using Additionally, would a feature such as #66 help you out? I'm assuming it'd be even more work to get your ETL-regex-monster to recognize column comments etc... |
I would like to understand the reason of forcing table and column names to be lower case.
In
postgres.py
, on lines 385 and 392, the regexes only accept lowercase identifiers (and the error messages are in line with that).What would be potential consequences of allowing case sensitive identifier names? For example, changing the regexes by:
'^[a-z_A-Z].*'
and'^[a-z0-9_$A-Z]+$'
Thank you.
Regards
The text was updated successfully, but these errors were encountered: