Hi there, Chris…. good to “see” you again!

First, thank you for the article. It takes time to write such articles as well as to write, test, and explain the demo code. Anyone that takes such time is “Aces” in my book. Thank you also for the honorable mention in your good article.

As you pointed out, the random INT generator formula in the CASE operation is regenerated for each usage making it so that the second WHEN isn’t based on the same random INT as the first WHEN. That, of course and as you explained, causes the skew that you pointed out.

Your mathematical solution to correct the problem is great but, for UPDATEs, isn’t actually needed. We can avoid the complexity of having to realize such a formula by fixing the actual problem.

The actual problem is making it so we generate the random INT just once for every row. The really cool part is that the UPDATE clause in T-SQL has this functionality built in.

I didn’t see a link for your Appointment table so I used the following code to generate one. As usual for me, it’s a million row test table so that we can test for both functionality and performance with the same table generation code. You can reduce the number for the TOP for easy functionality testing and then use the number 1000000 (or more) for performance testing once the functionality testing has proven satisfactory.

–===== Create and populate the test table so that people don’t have to

— go looking for the test table you used.

— DROP TABLE IF EXISTS dbo.Appointment –Commented out for safety

;

SELECT TOP 1000000

AppointmentID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

,DataSet = CONVERT(CHAR(10),”) –Preseting to blank

,SimOtherCols = CONVERT(CHAR(100),’X’)

INTO dbo.Appointment

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

Shifting gears back to the problem at hand, if you check the documentation for the UPDATE statement at the following URL…

https://docs.microsoft.com/en-us/sql/t-sql/queries/update-transact-sql?view=sql-server-ver15

…you find the following is allowed in the SET clause…

@variable = expression

We can take advantage of that to “materialize” the random INT once for each and every row, which allows us to simplify everything as you had in the original #2 example but, this time it works as expected. (My apologies if this site causes the code to “wrap”.

–===== This works without complicated esoteric formulas in the CASE operator

DECLARE @TrainingPercentage INT = 70

,@ValidationPercentage INT = 15

,@TestingPercentage INT = 15

,@Rand1To100 INT –<=====<<<< Added this.

;

–===== This works without complicated esoteric formulas in the CASE operator.

— And, yes… it's supported in Books Online but the authors of Books Online make a silly claim tha

— it's not predictable and nothing could be further from the truth.

— Obviously, this isn't going to work for SELECTs. It only works for UPDATEs.

— @Rand1To100 is calculated for every row updated an then immediately used by the formula for Dataset.

UPDATE dbo.Appointment

SET @Rand1To100 = ABS(CHECKSUM(NEWID())%100)+1 –This fully materializes ONCE for EVERY row.

,Dataset = CASE

WHEN @Rand1To100 <= @TrainingPercentage THEN 'Training'

WHEN @Rand1To100 <= @TrainingPercentage + @ValidationPercentage THEN 'Validation'

ELSE 'Testing'

END;

–===== Show the counts to prove it works (with varying numbers, as you pointed out)

SELECT DataSet

,NumberOfRows = COUNT(*)

FROM dbo.Appointment

GROUP BY DataSet

;

Now, what about the "perfect" distribution you were able to get in your example #5? The answer is that we don't need any temp tables and we can actually pull this off in a single UPDATE. Here's the code. It uses the "trick" of using a CTE to generate the random value and then, using that random value from the CTE, we can update the CTE itself, which causes the underlying table to be updated. (I wish MS would document that feature a whole lot better!).

–===== This method still does the random assignment according to the following percentages

— but always creates a "perfect" distribution of the random values.

DECLARE @TrainingPercentage INT = 70

,@ValidationPercentage INT = 15

,@TestingPercentage INT = 15

;

WITH cteEnumerate AS

(–===== Assign a 1 to 100 number based on the modulus a random row number

— where the row number is totally unique in a serial fashion but

— randomly assigned for each row.

SELECT AppointmentID

,Dataset

,Rand1To100 = ROW_NUMBER() OVER (ORDER BY NEWID())%100+1

FROM dbo.Appointment

)

UPDATE cteEnumerate

SET DataSet = CASE

WHEN RandRowNum <= @TrainingPercentage THEN 'Training'

WHEN RandRowNum <= @TrainingPercentage + @ValidationPercentage THEN 'Validation'

ELSE 'Testing'

END

;

–===== Show the counts to prove it works (with varying numbers, as you pointed out)

SELECT DataSet

,NumberOfRows = COUNT(*)

FROM dbo.Appointment

GROUP BY DataSet

;

–===== Show that the results actually are random even though the percentage for

— each random is always perfect now.

SELECT TOP 100*

FROM dbo.Appointment

ORDER BY AppointmentID

;

Thank you again for the article and keep it up!

]]>I really wish forum software were a bit more friendly when it comes to formatting code and other things that require leading white space. Nearly 3 decades in the making a so few forum/blog software gets it right. I had even converted the leading spaces to “non-breaking spaces” in Word before I posted.

]]>First, thank you for the article. It takes time to write such articles as well as to write, test, and explain the demo code. Anyone that takes such time is “Aces” in my book. Thank you also for the honorable mention in your good article.

As you pointed out, the random INT generator formula in the CASE operation is regenerated for each usage making it so that the second WHEN isn’t based on the same random INT as the first WHEN. That, of course and as you explained, causes the skew that you pointed out.

Your mathematical solution to correct the problem is great but, for UPDATEs, isn’t actually needed. We can avoid the complexity of having to realize such a formula by fixing the actual problem.

The actual problem is making it so we generate the random INT just once for every row. The really cool part is that the UPDATE clause in T-SQL has this functionality built in.

I didn’t see a link for your Appointment table so I used the following code to generate one. As usual for me, it’s a million row test table so that we can test for both functionality and performance with the same table generation code. You can reduce the number for the TOP for easy functionality testing and then use the number 1000000 (or more) for performance testing once the functionality testing has proven satisfactory.

–===== Create and populate the test table so that people don’t have to

— go looking for the test table you used.

— DROP TABLE IF EXISTS dbo.Appointment –Commented out for safety

;

SELECT TOP 1000000

AppointmentID = ROW_NUMBER() OVER(ORDER BY (SELECT NULL))

,DataSet = CONVERT(CHAR(10),”) –Preseting to blank

,SimOtherCols = CONVERT(CHAR(100),’X’)

INTO dbo.Appointment

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

Shifting gears back to the problem at hand, if you check the documentation for the UPDATE statement at the following URL…

https://docs.microsoft.com/en-us/sql/t-sql/queries/update-transact-sql?view=sql-server-ver15

…you find the following is allowed in the SET clause…

@variable = expression

We can take advantage of that to “materialize” the random INT once for each and every row, which allows us to simplify everything as you had in the original #2 example but, this time it works as expected. (My apologies if this site causes the code to “wrap”.

–===== This works without complicated esoteric formulas in the CASE operator

DECLARE @TrainingPercentage INT = 70

,@ValidationPercentage INT = 15

,@TestingPercentage INT = 15

,@Rand1To100 INT –<=====<<<< Added this.

;

–===== This works without complicated esoteric formulas in the CASE operator.

— And, yes… it's supported in Books Online but the authors of Books Online make a silly claim tha

— it's not predictable and nothing could be further from the truth.

— Obviously, this isn't going to work for SELECTs. It only works for UPDATEs.

— @Rand1To100 is calculated for every row updated an then immediately used by the formula for Dataset.

UPDATE dbo.Appointment

SET @Rand1To100 = ABS(CHECKSUM(NEWID())%100)+1 –This fully materializes ONCE for EVERY row.

,Dataset = CASE

WHEN @Rand1To100 <= @TrainingPercentage THEN 'Training'

WHEN @Rand1To100 <= @TrainingPercentage + @ValidationPercentage THEN 'Validation'

ELSE 'Testing'

END;

–===== Show the counts to prove it works (with varying numbers, as you pointed out)

SELECT DataSet

,NumberOfRows = COUNT(*)

FROM dbo.Appointment

GROUP BY DataSet

;

Now, what about the "perfect" distribution you were able to get in your example #5? The answer is that we don't need any temp tables and we can actually pull this off in a single UPDATE. Here's the code. It uses the "trick" of using a CTE to generate the random value and then, using that random value from the CTE, we can update the CTE itself, which causes the underlying table to be updated. (I wish MS would document that feature a whole lot better!).

–===== This method still does the random assignment according to the following percentages

— but always creates a "perfect" distribution of the random values.

DECLARE @TrainingPercentage INT = 70

,@ValidationPercentage INT = 15

,@TestingPercentage INT = 15

;

WITH cteEnumerate AS

(–===== Assign a 1 to 100 number based on the modulus a random row number

— where the row number is totally unique in a serial fashion but

— randomly assigned for each row.

SELECT AppointmentID

,Dataset

,Rand1To100 = ROW_NUMBER() OVER (ORDER BY NEWID())%100+1

FROM dbo.Appointment

)

UPDATE cteEnumerate

SET DataSet = CASE

WHEN RandRowNum <= @TrainingPercentage THEN 'Training'

WHEN RandRowNum <= @TrainingPercentage + @ValidationPercentage THEN 'Validation'

ELSE 'Testing'

END

;

–===== Show the counts to prove it works (with varying numbers, as you pointed out)

SELECT DataSet

,NumberOfRows = COUNT(*)

FROM dbo.Appointment

GROUP BY DataSet

;

–===== Show that the results actually are random even though the percentage for

— each random is always perfect now.

SELECT TOP 100*

FROM dbo.Appointment

ORDER BY AppointmentID

;

Thank you again for the article and keep it up!

]]>Thanks, Anandi!

]]>Thanks, Martin!

]]>