Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve our SQL Server client-generated GUIDs #33579

Open
roji opened this issue Apr 20, 2024 · 16 comments
Open

Improve our SQL Server client-generated GUIDs #33579

roji opened this issue Apr 20, 2024 · 16 comments

Comments

@roji
Copy link
Member

roji commented Apr 20, 2024

There's a report that our current client-side sequential GUIDs generator for SQL Server doesn't work well for concurrent environments; #30753 was opened to consider switching to server-generated GUIDs instead, as a way of mitigating that.

However, @JackTrapper posted an analysis in #30753 (comment), according to which the issue could be mitigated by improving the client-side generation logic instead. If this indeed makes our client-generated GUIDs as indexing-efficient as server-side ones generated via NEWSEQUENTIALID, that would be a good improvement that should also maintain EF backwards compatibility.

@roji
Copy link
Member Author

roji commented Apr 20, 2024

@JackTrapper would you be interested in submitting a PR for this? EF's client-side GUID generator for SQL Server is in https://github.com/dotnet/efcore/blob/main/src/EFCore/ValueGeneration/SequentialGuidValueGenerator.cs.

@cremor
Copy link

cremor commented Jul 12, 2024

dotnet/runtime#103658 was completed recently and .NET 9 will contain Guid.CreateVersion7().
Will this help? So will this new method solve the problems mentioned by @JackTrapper in #30753 (comment)?

@roji
Copy link
Member Author

roji commented Jul 12, 2024

@cremor I'm not sure, this will require some careful investigation.

@roji roji added this to the Backlog milestone Jul 12, 2024
@cremor
Copy link

cremor commented Jul 12, 2024

I searched a bit more and it looks like UUID v7 (which is used by the new Guid.CreateVersion7()) is not matching the SQL Server sorting behavior.

See:
https://github.com/mareek/UUIDNext/blob/main/Src/UUIDNext/Uuid.cs#L28-L35
https://github.com/mareek/UUIDNext/blob/main/Src/UUIDNext/Generator/UuidV8SqlServerGenerator.cs
mareek/UUIDNext#2

@JackTrapper
Copy link

dotnet/runtime#103658 was completed recently and .NET 9 will contain Guid.CreateVersion7(). Will this help? So will this new method solve the problems mentioned by @JackTrapper in #30753 (comment)?

I'm certain it will not help for SQL Server because it does not tailor itself to SQL Server's sort order.

Which is appropriate; it is correct for C# to be agnostic to SQL Server, Oracle, DB2, Postgres.

No, this feature needs to be a database client feature. And it's not like one-size-fits all, because different databases use different sort rules.

No, what you need is:

public Guid CreateSQLServerSortableGuid(); // uniqueidentifier data type

public Guid CreatePostgresSortableGuid(); // uuid data type

I don't believe any other database engines have a uniqueidentifier/uuid data type; but just binary(16), char(36), or varchar(36).

So there is no such thing as a one-size-fits-all solution. You have to know you are generating a UUID to be fed to Microsoft SQL Sever. I don't know where/if in EF Core you tell it the flavor of database engine you're using-so that way it can adjust the bits of a standard type-1 uuid to make is collation compatible with the database back-end engine.

No matter what you want to start with UuidCreateSequential (or the local OS's equivalent) becuase there are rules about how to handle two UUIDs being generated within the same 100 ns tick. And it's not an issue you can ignore; modern CPUs are fast enough you can generate 200,000 UUIDs in the same 100 ns tick. The solution is the system-wide collision counter documented in the RFC, and already implemented on Windows by UuidCreateSequential.

Once you have your type-1 uuid created by UuidCreateSequential, you then have to move the bytes around so that the 64-bit (8-byte) timestamp is sorted first.

So, no, any implemenation of Guid.CreateVersion7() will not be useful here, since it was not designed with my specific favorite database engine in mind.

@osexpert
Copy link

osexpert commented Aug 1, 2024

Uuidnext has a variant of uuid v8 that is sortable in ms sql server : https://github.com/mareek/UUIDNext/blob/main/Src/UUIDNext/Generator/UuidV8SqlServerGenerator.cs

@ConnerPhillis
Copy link

@JackTrapper

I ran your code for sortable guids in (https://github.com/dotnet/efcore/issues/30753#issuecomment-2027429203)[his comment] on #30753. I found that it does not significantly differ from the earlier mentioned GUID generation strategies. The GUIDs generated used an average page space of 67%, and fragmentation was 41%. I have upgraded hardware at this point, so there could be some skew in there.

I'm wondering if maybe there's something wrong with my testing methodology now, though, as this seems really bad for something that seems so well documented and thought out.

Is there any chance that you can explain why I got such a high fragmentation %?

@JackTrapper
Copy link

JackTrapper commented Oct 10, 2024

Is there any chance that you can explain why I got such a high fragmentation %?

I'd have to say:

  • SQL Server doesn't sort UUIDs in the order i thought
  • SQL Server changed the way it sorts UUIDs to something different than i originally thought
  • there's a bug in how i shuffle around the bytes of the type-1 UUID to match SQL Server's sort order

@SubjectiveReality
Copy link

Uuidnext has a variant of uuid v8 that is sortable in ms sql server : https://github.com/mareek/UUIDNext/blob/main/Src/UUIDNext/Generator/UuidV8SqlServerGenerator.cs

does this have similar concurrency issues though

@ConnerPhillis
Copy link

At this point I'm not sure if there's any point in keeping this issue open. The problem that everyone is going to run in to is that entity framework prioritizes knowing your Guid ID in advance over having the DB generate one serially.

This behavior has been in EF for every major version (1.0 - 9.0 of .NET). If they were to try to change something about this it would probably break far more peoples' code than it would help.

Likely the only thing that needs to be changed is the documentation. This blog post says that:

By convention, non-composite primary keys of type short, int, long, or Guid are set up to have values generated for inserted entities if a value isn't provided by the application. Your database provider typically takes care of the necessary configuration; for example, a numeric primary key in SQL Server is automatically set up to be an IDENTITY column

This probably isn't the same verbiage as it was back in 2.1 when I started working with EF core, but I remember thinking that from the docs it seemed like the value was being generated SQL side.

IMO spelling out the behavior is important, and that's why I originally wrote that blog post. There's nothing that outright says "hey, your guids are being generated by the application by default" and goes in to the implications of that default behavior. You have to specifically go hunting for how to opt out of the behavior, and by that time you've probably already gone so far that its too late.

@roji
Copy link
Member Author

roji commented Mar 1, 2025

@ConnerPhillis I'll note that just below the paragraph you cited above, you can find this one:

For example, on SQL Server, when a GUID property is configured as a primary key, the provider automatically performs value generation client-side, using an algorithm to generate optimal sequential GUID values. However, specifying ValueGeneratedOnAdd on a DateTime property will have no effect (see the section below for DateTime value generation).

That's very explicit about the value generation happening client-side; I do agree that someone only reading the Primary key section could miss this, and so that these docs could be slightly improved, though.

by that time you've probably already gone so far that its too late.

How is it too late? Changing GUID value generation from client-side to server-side is trivial, and can be done at any point - the fact that some GUIDs have already been generated client-side shouldn't prevent anyone from switching to server-generated.

@ConnerPhillis
Copy link

ConnerPhillis commented Mar 3, 2025

Hey @roji ,

I'll clarify - I think ideally, the behavior should be noted for the various primary key types in a table, e.g.

Key Type Generation Strategy
int Identity column in chosen database provider
guid Generated client side, optimized for SQL server (link to documentation)
string ...

I'd argue what you found is explicit, but its hard to notice. The first part feels like it suggests guids are generated server side, and that's all I read. Yes that's on me for not reading the whole doc, but I imagine that others are doing the same as well. Might be less of a problem now that ChatGPT is a thing, tho.

To my second part about it being "to late", consider the guids that are generated by NEWSEQUENTIALID in SQL Server

744212a5-33f8-ef11-861e-00155d096d22
754212a5-33f8-ef11-861e-00155d096d22
764212a5-33f8-ef11-861e-00155d096d22
(second digit is changing)

vs the guids that are generated by SequentailGuidValueGenerator

09ff63c3-eb9f-4e92-c3df-08dd5a59a1eb
e19cb6c4-5608-4bdd-c3e0-08dd5a59a1eb
4b1bab3f-37d4-43ae-c3e1-08dd5a59a1eb
(several digits changing at beginning and fourth part)

Now - this may be me speaking out of turn here as I am not a db expert, but I don't think that SQL would know how to optimize this index if we were to swap between strategies.

Say that you started with server generated ids and wanted to swap to db generated because of consistency issues. If SQL is using the first component of the GUIDs to order here, then that means that every new GUID would have to be inserted somewhere between GUIDs that are far more random. Ordering the insert would look something like:

09ff63c3-eb9f-4e92-c3df-08dd5a59a1eb
4b1bab3f-37d4-43ae-c3e1-08dd5a59a1eb
> 744212a5-33f8-ef11-861e-00155d096d22
> 754212a5-33f8-ef11-861e-00155d096d22
> 764212a5-33f8-ef11-861e-00155d096d22
e19cb6c4-5608-4bdd-c3e0-08dd5a59a1eb

Again, I'm not a DB expert, but I would think that this would be really bad for index and page fragmentation. I could run a benchmark on this but that would probably have to wait until the weekend.

My thinking (and please correct me if I'm wrong) is that if you were to swap from client to server side (or vise versa) the index fragmentation would likely be just as bad. Essentially if you go to production using a particular guid generation strategy, you're locked in because trying to swap to the other means you'll end up with the same problem at a different severity.

@roji
Copy link
Member Author

roji commented Mar 3, 2025

The first part feels like it suggests guids are generated server side, and that's all I read.

I don't think that's quite true:

By convention, non-composite primary keys of type short, int, long, or Guid are set up to have values generated for inserted entities if a value isn't provided by the application. Your database provider typically takes care of the necessary configuration; for example, a numeric primary key in SQL Server is automatically set up to be an IDENTITY column.

The text does says "values generated" without specifying where or who generates them. I do agree this is a bit technical/pedantic, and we could make things more immediately clear though. One of the issues here is that different providers/databases can and do do different things: it's perfectly fine for one provider to generate GUIDs in the database, and another on the client. For a concrete example, relational providers usually default to generating int keys in the database ("identity"), but Cosmos does not support this.

All this to say that this is a very general part of the docs that docs not attempt to say what happens on each different database/providers; this is why it's ambiguous on exactly where the generation occurs.

Now - this may be me speaking out of turn here as I am not a db expert, but I don't think that SQL would know how to optimize this index if we were to swap between strategies.

SequentailGuidValueGenerator basically employes the same GUID generation as the SQL Server NEWSEQUENTIALID function -that's basically the reason it exists; so in principle, switching between the two shouldn't be disruptive and generate GUIDs which can be indexed well together.

@ConnerPhillis
Copy link

ConnerPhillis commented Mar 9, 2025

@roji

One of the issues here is that different providers/databases can and do do different things: it's perfectly fine for one provider to generate GUIDs in the database, and another on the client. For a concrete example, relational providers usually default to generating int keys in the database ("identity"), but Cosmos does not support this.

I think I should settle on agreeing with you and realizing that things are less standardized than I initially thought within ef

SequentailGuidValueGenerator basically employes the same GUID generation as the SQL Server NEWSEQUENTIALID function -that's basically the reason it exists; so in principle, switching between the two shouldn't be disruptive and generate GUIDs which can be indexed well together.

I don't think that I agree with this. The guids that I generated earlier in my example came from SQL server vs sequential value generator. But I wanted to run benchmarks to see if what I was saying was accurate before I said something wrong.

in the case that you go from NEWSEQUENTIALID to EF generated guids

I ran this benchmark by first inserting 100,000 guids from SQL server NEWSEQUENTIALID() and then after inserting 100,000 guid ids from SQL server's guid value generation strategy. This is sort of a "best case" scenario, both of these methods used in isolation generate a fragmentation % of around 1%. When my benchmark was done fragmentation remained at .84%, however the insert performance on the DB absolutely plummeted.

Inserting the GUIDs w/ NEWSEQUENTIALID (I feel like I'm shouting) took ~ 7 seconds in SQL server. When I started inserting the GUIds in EF core, the first batch of 1,000 took 1.6 seconds, and rose in a sort of logarithmic curve before settling at ~ 17 seconds an insert.

This leads me to believe that index fragmentation is kept very low at the cost of insert performance. SQL server is likely having a hard time coping with the two different formats for GUIDs. I think it's running in to issues balancing and having to split several pages for every insert

in the case that you go from EF generated to NEWSEQUENTIALID

A quick observation before I get too far in to this that I think is really important: I completely nuked the database (not server) and started over. The performance degradation that I encountered on insert still exists. We start at 1.6 seconds for an insert of 1,000 IDs, and degrade all the way down again. Here's a graph of my output from excel:

It somewhat alarms me that this is linear. Hopefully I'm doing something wrong?

edit: this was wrong, I forgot to clear tracked changes so EF was evaluating too much. Issue was not SQL server (it was me :) )

Anyways, once I was done inserting everything, fragmentation was at .54%, in line with what I got in my first go round. I altered the table to do the GUID generation in the DB instead and went to generating. This time, the generation took ~9 seconds, and fragmentation remained at .61%.

So, you are correct and I am wrong. Swapping from SQL to EF core GUIDs is not a risk to database performance. I am going to keep benchmarking though. It is alarming to me that the EF inserts are behaving so poorly, I want to figure out what is at the root cause of this.

@ConnerPhillis
Copy link

ConnerPhillis commented Mar 9, 2025

@roji almost my entire previous comment is wrong and can be explained by me not clearing out the change tracker on a long-lived db context object. In summary, your assumption is correct, I am incorrect.

@roji
Copy link
Member Author

roji commented Mar 10, 2025

@ConnerPhillis no worries, this stuff is hard to check and you're far from the first person to get change tracking interfering in benchmarking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants