Splitting Data For Machine Learning, Using T-SQL

It is a common practice in data science to split our data up into three separate datasets in order to perform machine learning (ML) on it. The majority of the data will be used to Train our ML models and then a portion of the remaining data will be used to Validate those models. The best of these models will then be applied to our Training data to evaluate its performance.

Graphical ML tools such as Azure Machine Learning often provide easily configurable drag-and-drop tools to split our data in this manner, but what happens if we are working on a custom solution, perhaps using something like SQL Server’s In-Database Machine Learning? In this blog post we’ll look at a couple of different T-SQL solutions to quickly split our data randomly into these datasets.

A common approach is to use 60-80% of our data for the Training set and to divide the remainder evenly between Validation and Testing. For our purposes we’ll split the difference and dedicate 70% of our data to training, leaving 15% to Validation and 15% to Testing.

For these examples, we’ll be looking at patient appointment data from our sample data warehouse. Firstly we want to maintain a record of which rows of data were in which dataset for the purposes of reproducibility and tracking data lineage, so we’ll add a column to our table to hold that information.

ALTER TABLE fact.Appointment
ADD Dataset VARCHAR(10) NULL;

Next we’ll update that column randomly to one of Training, Validation, or Testing based on the percentages we specified above and run a quick GROUP BY query to look at the results.

--  Method #1 - Using the RAND() function
--  Definitely not the results we want!

DECLARE
    @TrainingPercentage     INT = 70,
    @ValidationPercentage   INT = 15,
    @TestingPercentage      INT = 15;


UPDATE
    fact.Appointment
SET
    Dataset =   CASE
                    WHEN 100 * RAND() < @TrainingPercentage THEN 'Training'
                    WHEN 100 * RAND() < (@TrainingPercentage + @ValidationPercentage) THEN 'Validation'
                    ELSE 'Testing'
                END;


SELECT
    Dataset,
    NumberOfRows = COUNT(*)
FROM
    fact.Appointment
GROUP BY
    Dataset
ORDER BY
    Dataset;

We see right away that this method failed horribly as all of the data was placed into the same dataset. This holds true no matter how many times we execute the code, and it happens because the RAND() function is only evaluated once for the whole query, and not individually for each row. To correct this we’ll instead use a method that Jeff Moden taught me at a SQL Saturday in Detroit several years ago – generating a NEWID() for each row, using the CHECKSUM() function to turn it into a random number, and then the % (modulus) function to turn it into a number between 0 and 99 inclusive.

-- Method #2 - Using the NEWID() function
-- Getting closer, but still not quite there

DECLARE
    @TrainingPercentage     INT = 70,
    @ValidationPercentage   INT = 15,
    @TestingPercentage      INT = 15;


UPDATE
    fact.Appointment
SET
    Dataset =   CASE
                    WHEN ABS(CHECKSUM(NEWID())) % 100 < @TrainingPercentage THEN 'Training'
                    WHEN ABS(CHECKSUM(NEWID())) % 100 < (@TrainingPercentage + @ValidationPercentage) THEN 'Validation'
                    ELSE 'Testing'
                END;


SELECT
    Dataset,
    NumberOfRows = COUNT(*)
FROM
    fact.Appointment
GROUP BY
    Dataset
ORDER BY
    Dataset;

So this method failed too. Although we are getting the expected number of Training rows, the split between Validation and Testing is way off. What we find is that the NEWID() generated in the second line of the CASE statement is not the same value as in the first line, so we are getting an 85/15 split on the 30% of non-Training data instead of the expected 50/50. Let’s tweak the logic in that third line to fix the error.

-- Method #3 - Fixing the logical error
-- This works, but the size of the datasets varies slightly each run

DECLARE
    @TrainingPercentage     INT = 70,
    @ValidationPercentage   INT = 15,
    @TestingPercentage      INT = 15;


UPDATE
    fact.Appointment
SET
    Dataset =   CASE
                    WHEN ABS(CHECKSUM(NEWID())) % 100 < @TrainingPercentage THEN 'Training'
                    WHEN ABS(CHECKSUM(NEWID())) % 100 < (@ValidationPercentage * 100 / (@ValidationPercentage + @TestingPercentage)) THEN 'Validation'
                    ELSE 'Testing'
                END;


SELECT
    Dataset,
    NumberOfRows = COUNT(*)
FROM
    fact.Appointment
GROUP BY
    Dataset
ORDER BY
    Dataset;

And success! The only thing we may notice is that if we execute this code multiple times, we get a slightly different number of rows in each dataset.

What if we have a requirement to have exactly the desired percentage of rows in each dataset, and to get that same exact percentage every time we execute? In this situation, we are going to generate a temp table that distributes the exact number of each dataset value randomly across all the rows, and then use that temp table to update the original table.

-- Method #4 - Consistent and exact

DECLARE
    @RowCount               INT,
    @TrainingPercentage     INT = 70,
    @TrainingCount          INT,
    @ValidationPercentage   INT = 15,
    @ValidationCount        INT,
    @TestingPercentage      INT = 15,
    @TestingCount           INT;


SELECT
    @RowCount = COUNT(*)
FROM
    fact.Appointment;


SET @TrainingCount = (FLOOR(@RowCount * @TrainingPercentage / 100.00) + 0.5);
SET @ValidationCount = (FLOOR(@RowCount * @ValidationPercentage / 100.00) + 0.5);
SET @TestingCount = @RowCount - (@TrainingCount + @ValidationCount);

SELECT
    TrainingCount = @TrainingCount,
    ValidationCount = @ValidationCount,
    TestingCount = @TestingCount;


IF EXISTS (SELECT name FROM tempdb.sys.objects WHERE name LIKE '%Dataset%')
    DROP TABLE #Dataset;

CREATE TABLE
    #Dataset
(
    AppointmentKey      INT PRIMARY KEY,
    Dataset             VARCHAR(10) NULL
);


WITH SourceRows AS
(
    SELECT
        AppointmentKey,
        RowNumber = ROW_NUMBER() OVER (ORDER BY CHECKSUM(NEWID()))
    FROM
        fact.Appointment
)
INSERT INTO
    #Dataset
SELECT
    AppointmentKey,
    Dataset =   CASE
                    WHEN RowNumber <= @TrainingCount THEN 'Training'
                    WHEN RowNumber <= (@TrainingCount + @ValidationCount) THEN 'Validation'
                    ELSE 'Testing'
                END
FROM
    SourceRows;


UPDATE
    Appt
SET
    Dataset = Dt.Dataset
FROM
    fact.Appointment Appt
        INNER JOIN
    #Dataset Dt
        ON Appt.AppointmentKey = Dt.AppointmentKey


SELECT
    Dataset,
    NumberOfRows = COUNT(*)
FROM
    fact.Appointment
GROUP BY
    Dataset
ORDER BY
    Dataset;


SELECT TOP 25
    AppointmentKey,
    Dataset
FROM
    fact.Appointment;

Now we see that we divide the rows in exactly the same proportions with each execution, and we’ve also included a SELECT TOP 25 so we can verify that each execution assigns different datasets to each row.

It’s possible that due to the JOIN this method might not be the best performing over extremely large datasets, so we also developed a brute-force method of splitting that performs three sequential updates that we could use instead in that specific circumstance.

-- Method #5 - The brute-force sequential method

DECLARE
    @RowCount               INT,
    @TrainingPercentage     INT = 70,
    @TrainingCount          INT,
    @ValidationPercentage   INT = 15,
    @ValidationCount INT,
    @TestingPercentage INT = 15,
    @TestingCount INT;


SELECT
    @RowCount = COUNT(*)
FROM
    fact.Appointment;


SET @TrainingCount = (FLOOR(@RowCount * @TrainingPercentage / 100.00) + 0.5);
SET @ValidationCount = (FLOOR(@RowCount * @ValidationPercentage / 100.00) + 0.5);
SET @TestingCount = @RowCount - (@TrainingCount + @ValidationCount);

SELECT
    TrainingCount = @TrainingCount,
    ValidationCount = @ValidationCount,
    TestingCount = @TestingCount;

UPDATE
    fact.Appointment
SET
    Dataset = NULL;


UPDATE
    Appt
SET
    Dataset = 'Training'
FROM
    (
        SELECT TOP (@TrainingCount)
            *
        FROM
            fact.Appointment
        ORDER BY
            CHECKSUM(NEWID())
    ) Appt;


UPDATE
    Appt
SET
    Dataset = 'Validation'
FROM
    (
        SELECT TOP (@ValidationCount)
            *
        FROM
            fact.Appointment
        WHERE
            Dataset IS NULL
        ORDER BY
            CHECKSUM(NEWID())
    ) Appt;


UPDATE
    fact.Appointment
SET
    Dataset = 'Testing'
WHERE
    Dataset IS NULL;


SELECT
    Dataset,
    NumberOfRows = COUNT(*)
FROM
    fact.Appointment
GROUP BY
    Dataset
ORDER BY
    Dataset;


SELECT TOP 25
    AppointmentKey,
    Dataset
FROM
    fact.Appointment;

Hopefully you’ve enjoyed this look at randomly splitting our data in T-SQL. Please let me know in the comments what other methods you’ve used to tackle this problem!

Getting Started With Azure Database for MySql

I thought I’d share some bits and pieces about getting started with Azure Database for MySql, the Azure platform-as-a-service (Paas) offering for the MySql relational database engine. Initially I started looking for a good sample database to use for demonstration purposes, but unfortunately either my search engine skills or my patience were lacking. However, that gave me a good opportunity to set up a new demo database from scratch!

I started out by setting up a dedicated resource group to use for my instance, and then used the Azure Portal GUI to create a new instance named mysql-20200505. I made sure to downgrade from the default General Purpose configuration to Basic, so it will only cost me about $67 a month if I leave it running instead of around $350. After the instance was created successfully I then added some connection security rules to ensure that only my IP was able to connect to it.

I then opened up MySql Workbench to connect to the server as pictured below. Of course it took me two tries to connect as I made my usual error of not including the machine name in the username field the first time around.

I quickly created a new database:

CREATE DATABASE demo;

I then queried the system catalog to make sure that the database had been created and made available successfully:

The first table I decided to build was a tally table – a table of sequential numbers. I placed this in a new schema named admin, and I’ll use it to populate some of the other tables I’ll be creating in the future. I used a set-based method to generate this data by cross-joining one of the system catalog tables to itself. There are about 300 records in the catalog table in a brand new database, and so one cross join gives me 300 x 300, or 90,000 rows, which is plenty more than the 10,000 rows I want.

The only gotcha here was that I’d installed an older version of MySql from before it implemented the ROW_NUMBER() windowing function. Luckily my search engine skills had improved and I found a solution quite quickly using a local variable as below:

USE demo;

CREATE SCHEMA admin;

CREATE TABLE
    admin.tally
(
    SequenceNumber SMALLINT NOT NULL PRIMARY KEY
);


SET @row_number = 0;

INSERT INTO
    admin.tally
(
    SequenceNumber
)
SELECT
    (@row_number := @row_number + 1) AS SequenceNumber
FROM
    information_schema.tables tbl1
CROSS JOIN
    information_schema.tables tbl2
LIMIT
    10000;

And now my tally table is created, populated, and ready to use in the next blog post in which I’ll begin to create some dummy data warehouse tables for my demo database.

Thanks for following along, and please let me know in the comments below if there are any particular areas of MySql you’d like me to dive into deeper!

How I Passed The AZ-900 Azure Fundamentals Exam

Recently I passed the AZ-900 Azure Fundamentals exam from Microsoft. We could have a long conversation about the pros and cons of IT certifications in general, but for me wanting to pass this exam was great motivation to learn about a number of the Azure service offerings that I haven’t yet needed to use in a professional capacity.

The exam itself is a very broad, mostly non-technical, survey of cloud computing concepts and the various Azure services that are available. For example, you need to know that IoT Hub is the Azure service that you would use to manage large numbers of Internet-of-Things (IOT) devices, but you don’t need to know how to actually configure and use it.

Resources: Since I posted about passing the exam, several people have reached out to me asking what materials I used to prepare, so I’ve listed them here:

Microsoft Azure Fundamentals Exam Ref AZ-900 book
• The official AZ-900 practice test from Measure-Up
• Tim Warner’s Azure Fundamentals Study Guide video series on YouTube
• The Azure fundamentals learning path on Microsoft Learn

Microsoft Azure Fundamentals Exam Ref AZ-900 by Jim Cheshire is the official study guide for the test. It was my number one resource, and was a bargain at only $26 from Amazon. If you read it cover-to-cover you will be well on your way to passing the exam! The official practice test from MeasureUp was hugely important too, and I may write a follow-up post to dive into more detail about how I used it effectively. It costs $120, or is discounted to $80 if you purchase it at the same time you register for the exam.

Tim Warner’s [b|t] video series on YouTube is particularly good, and as a bonus, it’s free! He is creating a separate video for each of the nearly 50 skill entries on the exam guide, and as of this writing has completed nearly half of them. Each one is a 10-15 minute treatment of what you need to know about that skill and usually includes a short demo. Highly recommended!

The Azure fundamentals learning path on Microsoft Learn is another very good free resource. It takes about 6-10 hours to complete and covers all of the objectives on the exam, many with practical exercises. Unfortunately I didn’t make completing this a high enough priority and ran out of time to finish it before the exam!

Key Concepts: Here are some of the concepts and services that either came up multiple times or that tripped me up on the practice test. Not all of these appeared on my actual exam, but that was only 34 questions drawn from a much larger pool. I’ve grouped these based on the published exam objectives:

Understand cloud concepts
• IaaS vs PaaS vs SaaS
• Public vs private vs hybrid cloud models
• CapEx vs OpEx

Understand core Azure services
• Geographies, Regions, Availability Zones, Availability Sets
• Azure Load Balance vs Azure Application Gateway vs VPN Gateway
• IoT Hub vs IoT Central
• Machine Learning Service vs Machine Learning Studio
• Azure Functions vs Azure Logic Apps
• Azure CloudShell vs PowerShell vs CLI

Understand security, privacy, compliance, and trust
• AIP vs ATP

Understand Azure pricing and support
• Developer vs Standard vs Pro Direct vs Premier Support Plans

Effort: I put around 20 hours of work into passing the exam. I’m fortunate to be able to read very quickly and to be a good test-taker, so your mileage may vary. But with a modicum of effort most existing or aspiring IT professionals should be able to pass this exam quite easily.

I hope this post helps you get started on your way to passing the AZ-900 Azure Fundamentals exam!

SQL Train 2019

I’ve just returned home from PASS Summit 2019 and am taking a little bit of time to reflect on all I learned and to celebrate all of the great friends and colleagues that I connected with. Of course, my personal highlight was the SQL Train, on which me and 63 other SQL Server nerds traveled from Portland, OR to Seattle, WA.  This is an annual event for those traveling from SQL Saturday Oregon to Summit, and this was the fourth year in which I organized it.

Adrian Mee [B|T] was kind enough enough to share a couple of photos that he took of the event. Here we are boarding the two exclusive-occupancy train cars:

And here is a shot of how things looked once we were aboard:

No, I have no idea what had Neil Hambly [B|T] so freaked out either!

A pleasant surprise for this year was that even after all of the traditional SQL Train doughnuts, beer, and wine had been purchased, there was still money left over. Because of this I was very glad to be able to donate this surplus to Andy Mallon’s [B|T] #SQLFamily fundraiser for The Trevor Project on behalf of all those who rode this year’s SQL Train!

In the interest of full transparency I have included a screenshot of the financial details below:

Sadly, this will be the last SQL Train for at least two years. Next year’s PASS Summit will be held in Houston, TX, and it doesn’t look good for train routes from a nearby SQL Saturday. I’m sure that we’l figure out an appropriately fun replacement though!

T-SQL Tuesday #108: Non-SQL Server Technologies

This month’s T-SQL Tuesday is hosted by Mala Mahadevan [b|t], who has invited people to discuss new skills and technologies that they’d like to learn in the near future, with the twist that she’s particularly interested in non-SQL Server items this time out.

This ties in rather well with my post yesterday about my New (SQL) Year’s resolutions, one of which is to spend at least thirty minutes five times a week on professional development.  I’ve got two main non-SQL areas that I’m looking to learn more about between now and next year’s PASS Summit: Microsoft Azure and Python.

Although none of my clients are currently primed to move into the Azure cloud, I need to make sure that when they’re ready, I am too!  I’m looking to dive in to the business intelligence and analytics offerings, as well as being particularly intrigued by Azure Cosmos DB.  The first goal is to truly get my head around as many of the 100+ available services as possible so that I can understand why and when to use each one; the how portion can come later.

On the Python side I’ve got three main goals:

1.  Upgrade my pandas skills to be truly “pro-level”.   No more googling every other data frame manipulation I need to write!

2.  Learn Seaborn to take my plotting and visualization skill to the next level.

3.  Work on some fun areas of Python outside of the data science ecosystem to develop my overall Python skills.

I hope you’ll check back in on me over the next twelve months to find out how I’m doing!

 

Happy New Year!

For many of us in the Microsoft Data Platform community, PASS Summit is a bit like Christmas.  It’s a chance to get together with our SQL Family, make new friends, and start playing with all of the new toys that Microsoft has gifted to us.

By the same token, the first day back at work after Summit is akin to New Year’s Day.  We are hungover – often literally, but almost always metaphorically too – from the high of the previous week.  But we are also energized by the sense of community and shared excitement that we’ve just experienced.  It’s a great opportunity to turn that feeling into a new start!

So I am taking the opportunity to declare this day as the first day of my SQL year, and I’ve got a couple of New Year’s Resolutions to share:

1.  I resolve to spend at least one hour five days a week on physical exercise.  I spent all of this year’s Summit under the weather.  A sizable (pun intended) component of this is the weight that I’ve put back on in 2018 after losing fifty pounds in 2016 and 2017.  This has had a noticeable affect on my breathing, sleep, and general respiratory health, and it’s time to reverse the trend.

2.  I resolve to spend at least thirty minutes five days a week on professional development.  There’s so much cool technology to explore in the data platform space, and I find that as a consultant I am often too busy implementing techniques that I’ve done 1000 times before to learn new ones.  This will also include trying to produce more technical blog posts, as well as creating new content for presentations.  Speaking of which…

3.  I resolve to not submit any abstracts for new sessions or pre-cons until they are at least 75% developed.  I’ve caused myself a lot of unnecessary stress over the past couple of years by procrastinating about finishing my presentations.  I must do better on this front!

4.  I resolve to document the previous week’s activities on my blog.  Hopefully this will help me to keep myself accountable for sticking to the first three resolutions.

I invite you all to make your “New SQL Year’s Resolutions” with me, and look forward to toasting to our successes together at next year’s PASS Summit!

 

Introduction To sp_execute_external_script With Python, Part 1

I’ve been speaking on integrating Python and SQL Server for a while now, and having given the same presentation at nearly a dozen SQL Saturdays, PASS Summit, and multiple user groups, it’s about time to retire my favourite session.  Just in case there’s anyone out there who is interested in the subject but hasn’t managed to see it yet, I thought I’d start sharing some of the code and use-cases in a series of blog posts.

We’ll start off with a simple step-by-step introduction to the sp_execute_external_script stored procedure.  This is the glue that enables us to integrate SQL Server with the Python engine by sending the output of a T-SQL query over to Python and getting a result set back. .  For example, we could develop a stored procedure to be used as a data set in an SSRS report that returns statistical data produced by a Python library such as SciPy.  In this post we’ll introduce the procedure and how to pass a simple data set into and back out of Python, and we’ll get into manipulating that data in a future post.

If you’d like to follow along with me, you’ll need to make sure that you’re running SQL Server 2017 or later, and that Machine Learning Services has been installed with the Python option checked.  Don’t worry if you installed both R and Python, as they play quite nicely together.  The SQL Server Launchpad service should be running, and you should make sure that the ability to execute the sp_execute_external_script procedure is enabled by running the following code:

--  Introduction To sp_execute_external_script With Python - Query 1.00
--  Enabling sp_execute_external_script.

EXEC sp_configure 'external scripts enabled', 1;
RECONFIGURE;

 

As with any introduction to a new language, we’re contractually obligated to kick things off with a “Hello, world!”-style script.  You’ll note that our first procedure call below has only two parameters.  The @language parameter says that we’ll be using Python code as opposed to R, and that the Launchpad service should start instantiating the Python engine.   The @script parameter then contains the Python code that we’ll be executing, which in this case is a simple print statement:

--  Introduction To sp_execute_external_script With Python - Query 1.01
--  Calling the procedure with no input or output.

EXEC sp_execute_external_script
    @language       = N'Python',
    @script         = N'print("Welcome to the wonderful world of Python integration!")';

 

 

 

Now that we know our Python integration is working, the next step is to send some data over from SQL Server and get it back out.  We specify a T-SQL query to run in the @input_data_1 parameter; the SQL engine will execute this query and send the resulting data set over to Python.  The @script parameter has now been updated to simply say that we will take that data, identified by its default name of InputDataSet, copy it over to OutputDataSet (another default) and then return it to SQL Server.  I’ve got a tally table (a table of sequential numbers) set up in my Admin database, and I’ll begin by just sending across the first ten numbers from that table:

--  Introduction To sp_execute_external_script With Python - Query 1.02
--  Passing the top ten numbers from our tally table into Python
--  and getting them right back out.

USE Admin;
GO

EXEC sp_execute_external_script
    @language       = N'Python',
    @script         = N'OutputDataSet = InputDataSet',
    @input_data_1   = N'SELECT TOP 10 * FROM dbo.Tally;'
WITH RESULT SETS
((
    NewColumnName   INT NOT NULL
));

 

 

 

 

 

 

 

If for some reason you don’t have or are not able to add a tally table to your system, you can use the ROW_NUMBER() function against the sys.objects DMO instead:

--  Introduction To sp_execute_external_script With Python - Query 1.02b
--  Using sys.objects instead of a tally table.

EXEC sp_execute_external_script
    @language       = N'Python',
    @script         = N'OutputDataSet = InputDataSet',
    @input_data_1   = N'SELECT TOP 10 ROW_NUMBER() OVER (ORDER BY name) FROM sys.objects;'
WITH RESULT SETS
((
    NewColumnName   INT NOT NULL
));

 

 

There’s no need to keep the default name of “InputDataSet” for our input data set.  I’m quite partial to “myQuery” myself.  Renaming it is as simple as adding another parameter called @input_data_1_name.  Of course, we have to remember to update the python code in @script to reflect the new name:

--  Introduction To sp_execute_external_script With Python - Query 1.03
--  Using a custom name for the input data.

USE Admin;
GO

EXEC sp_execute_external_script
    @language           = N'Python',
    @script             = N'OutputDataSet = myQuery',
    @input_data_1       = N'SELECT TOP 10 * FROM dbo.Tally;',
    @input_data_1_name  = N'myQuery'
WITH RESULT SETS
((
    NewColumnName    INT NOT NULL
));

 

 

 

 

 

 

 

 

Unsurprisingly, we can also change the name of our output data set by adding the @output_data_1 parameter, and updating @script accordingly:

--  Introduction To sp_execute_external_script With Python - Query 1.04
--  Using custom names for both the input and output data.

USE Admin;
GO

EXEC sp_execute_external_script
    @language           = N'Python',
    @script             = N'Results = myQuery',
    @input_data_1       = N'SELECT TOP 10 * FROM dbo.Tally;',
    @input_data_1_name  = N'myQuery',
    @output_data_1_name = N'Results'
WITH RESULT SETS
((
    NewColumnName    INT NOT NULL
));

 

 

 

 

 

 

 

 

I should pause here to point out that although having parameters named @input_data_1, @input_data_1_name, and @output_data_1_name may imply that we can manage multiple input and output data sets with sp_execute_external_script, that is sadly not the case.  I’m hoping that this naming convention was chosen to allow room for that to happen in a future release (hint, hint, product team!)

Although this input query is very simple, in the real world we might be sending over the output of some very complex queries.  The input query in one of the use cases in the session I’ve been presenting is over 150 lines!  To make this more manageable in our integration, rather than embedding the whole query in our procedure call and making an unholy mess, we’ll instead store it in a variable called @InputQuery and pass that in to @input_data_1.  This allows us to use our standard T-SQL coding style, and has the added effect of being easy to copy-and-paste into a new query window for development and/or, heaven-forbid, debugging.

--  Introduction To sp_execute_external_script With Python - Query 1.05
--  Variablizing the input data query.

USE Admin;
GO

DECLARE @InputQuery     NVARCHAR(500);
SET @InputQuery = N'
SELECT TOP 10
    SequenceNumber
FROM
    dbo.Tally;
';

EXEC sp_execute_external_script
    @language           = N'Python',
    @script             = N'Results = myQuery',
    @input_data_1       = @InputQuery,
    @input_data_1_name  = N'myQuery',
    @output_data_1_name = N'Results'
WITH RESULT SETS
((
    NewColumnName    INT NOT NULL
));

 

 

 

 

 

 

 

(Yes, “variablizing” is a real word.  I checked.)

 

We can do the same thing with our Python code, which in this example we’ve stored in a variable named @PyScript and then passed in to the @script parameter:

--  Introduction To sp_execute_external_script With Python - Query 1.06
--  Variablizing both the input data query and the Python script.

USE Admin;
GO

DECLARE
    @InputQuery     NVARCHAR(500),
    @PyScript       NVARCHAR(500);

SET @InputQuery = N'
SELECT TOP 10
    SequenceNumber
FROM
    Admin.dbo.Tally;
';

SET @PyScript = N'
Results = myQuery
';

EXEC sp_execute_external_script
    @language           = N'Python',
    @script             = @PyScript,
    @input_data_1       = @InputQuery,
    @input_data_1_name  = N'myQuery',
    @output_data_1_name = N'Results'
WITH RESULT SETS
((
    NewColumnName    INT NOT NULL
));

 

 

 

 

 

 

 

 

Look at that nice clean procedure call!  Marvel at how the business logic in our T-SQL query and our Python code is separated out, formatted nicely, and ready to copy into another window or IDE!  Can we all now agree to use this format for our sp_execute_external_script calls going forward?

This concludes the end of part one, but we’ll return with part two very soon, in which we start to manipulate our data set in Python and see what happens when things don’t go so well…

 

Kentuckiana Mini-Tour

Having just finished leading the efforts on the very successful SQL Saturday Albuquerque, by all rights I should be taking this week off to recuperate.  Instead I’m starting a whirlwind “mini-tour” of Kentuckiana tomorrow, with three spots on which I’ll be presenting on SQL Server’s Machine Learning Services components.

Tuesday night I’ll be at Blue Stallion Brewing in Lexington, KY talking to the Lexington PASS group about how to leverage Python for statistical analysis in SQL Server 2017.  I’m looking forward to the unique experience of presenting with a pint glass in my hand rather than a bottle of water.  As long as I don’t try to use the pint glass like a laser pointer, I’ll be fine!  Register Here.

I don’t have any speaking engagements on Wednesday night, but will instead be heading up to Columbus, OH with my friend Matt Gordon [b|t] (who founded the Lexington PASS group) and his son to watch the Columbus Crew take on the Philadelphia Union.  As many of you know, I’m a devout soccer (real football) fan, and I’m really looking forward to visiting one of the most storied environments in US soccer.

Thursday is a double-header starting in Evansville, IN at 11:00 AM and Louisville, KY at 6:00 PM.  With a two-hour drive between them, not to mention the one-hour time difference, this will be an intense day.  It’s a good thing then that the session I’ll be presenting is the same for both groups!  Like the Lexington session I’ll be talking about statistical analysis, but this time I’ll be focusing on the R programming language, so the talk will be directly applicable to SQL 2016 as well as 2017.  Register Here for Evansville, IN.  Register Here for Louisville, KY.

If you live close to any of these three presentations, I’d love for you to come out so I can meet you and talk data and statistics with you!

 

Fun With Palindromes – Part 1

I was thinking about algorithms yesterday afternoon – yes, I am a strange one – and one thought led to another and soon I found myself contemplating palindromes – words or phrases that read the same forwards and backwards, such as “racecar” or Napoleon’s famous – albeit apochryphal – “Able was I ere I saw Elba!”  Specifically, I was wondering what would be the most efficient way to test a string’s palindromicity using T-SQL.

Immediately I realized that this algorithm will need to accomplish two different things.  I first need to remove all non-alphabetic characters from the string I am testing, because while “able was I ere I saw Elba” is palindromic even leaving the spaces intact, this will not work for other well-known palindromes such as “A man, a plan, a canal, Panama!”  Then the second task is to check that the remaining string is the same front-to-back as it is back-to-front.

With the help of Elder’s Dead Roots Stirring album I set out to find the most efficient T-SQL code to accomplish this task.  My plan was to avoid resorting to Google for the answer, but perhaps in a future post I will go back and compare my solution to the best one I can find online.  For this first post in the series I will tackle only the first task of removing the non-alphabetic characters from the string.

I’ll readily admit to not being well versed in the use of regular expressions, so in order to meet my Google-free objective I’ll have to find another solution for now.  Let’s start by seeing what we can do with the famed “A man, a plan, a canal, Panama!”  Because I know exactly which characters are in this string, I can use the brute-force method of nested REPLACE() functions to parse them out – see line 11 below.  (You may want to use the Toggle Line Wrap or Open Code In New Window options to see it better.)

--  Fun With Palindromes Query 1.01
--  This is the brute-force method of character removal.
--  Use this method only if you have a limited number of distinct,
--  known characters to remove
 
DECLARE
    @Phrase                 VARCHAR(50),
    @Phrase_LettersOnly     VARCHAR(50);
 
SET @Phrase = 'A man, a plan, a canal, Panama!';
 
SET @Phrase_LettersOnly = REPLACE(REPLACE(REPLACE(@Phrase, ' ', ''), ',', ''), '!', '');
 
PRINT @Phrase_LettersOnly;

 

 

 

Of course I may not always have the luxury of knowing the contents of the string beforehand.  I do know that the characters I want to keep will fall within the range of A-Z (ASCII values 65-90) or a-z (ASCII values 97-122).  My plan is to step through the string one character at a time, but instead of using a WHILE loop, I’ll instead use the tally table (a table of sequential numbers) in my Admin database like so:

--  Fun With Palindromes Query 1.02
--  Using a tally table to separate out the characters in the string
--  and keeping only those that fall in the desired ASCII range.

USE Admin;
GO

DECLARE
    @Phrase                 VARCHAR(50),
    @Phrase_LettersOnly     VARCHAR(50);

SET @Phrase = 'A man, a plan, a canal, Panama!';


SELECT
    SUBSTRING(@Phrase, Tl.SequenceNumber, 1)
FROM
    dbo.Tally Tl
WHERE
    Tl.SequenceNumber <= LEN(@Phrase)
    AND
    (ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 65 AND 90
        OR
        ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 97 AND 122)
ORDER BY
    Tl.SequenceNumber;

 

 

 

 

 

If for some reason you don’t have or are not able to add a tally table to your system, you can use the ROW_NUMBER() function against the sys.objects DMO instead:

--  Fun With Palindromes Query 1.03
--  Using ROW_NUMBER() against the sys.objects DMO.

SELECT
    SequenceNumber = ROW_NUMBER() OVER (ORDER BY name)
FROM
    sys.objects;

 

You can put this code into a common table expression and combine it with the second query:

--  Fun With Palindromes Query 1.04
--  Using ROW_NUMBER() against the sys.objects in a CTE instead
--  of a physical tally table.

DECLARE
    @Phrase                 VARCHAR(50),
    @Phrase_LettersOnly     VARCHAR(50);

SET @Phrase = 'A man, a plan, a canal, Panama!';


WITH Tally AS
(
    SELECT
        SequenceNumber = ROW_NUMBER() OVER (ORDER BY name)
    FROM
        sys.objects
)
SELECT
    SUBSTRING(@Phrase, Tl.SequenceNumber, 1)
FROM
    Tally Tl
WHERE
    Tl.SequenceNumber <= LEN(@Phrase)
    AND
    (ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 65 AND 90
        OR
        ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 97 AND 122)
ORDER BY
    Tl.SequenceNumber;

 

 

 

 

 

Of course, separate rows for each character don’t help us very much.  Let’s use the CONCAT() function to put them back into a single string:

--  Fun With Palindromes Query 1.05
--  Using CONCAT() to put the letters-only string back together.

USE Admin;
GO

DECLARE
    @Phrase                 VARCHAR(50),
    @Phrase_LettersOnly     VARCHAR(50);

SET @Phrase = 'A man, a plan, a canal, Panama!';


SELECT
    @Phrase_LettersOnly = CONCAT(@Phrase_LettersOnly, SUBSTRING(@Phrase, Tl.SequenceNumber, 1))
FROM
    dbo.Tally Tl
WHERE
    Tl.SequenceNumber <= LEN(@Phrase)
    AND
    (ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 65 AND 90
        OR
        ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 97 AND 122)
ORDER BY
    Tl.SequenceNumber;


PRINT @Phrase_LettersOnly;

 

 

 

This method works great to parse a single string, but I will have a whole table of potential palindromes to parse.  Historically, I would usually use the STUFF() function with FOR XML PATH to reassemble the strings like this:

--  Fun With Palindromes Query 1.06
--  Using STUFF() and FOR XML PATH to put the letters-only string
--  back together.

USE Admin;
GO

DECLARE
    @Phrase                 VARCHAR(50),
    @Phrase_LettersOnly     VARCHAR(50);

SET @Phrase = 'A man, a plan, a canal, Panama!';


SELECT
	@Phrase_LettersOnly = STUFF((
                                SELECT
                                    SUBSTRING(@Phrase, Tl.SequenceNumber, 1)
                                FROM
                                    dbo.Tally Tl
                                WHERE
                                    Tl.SequenceNumber <= LEN(@Phrase)
                                    AND
                                    (ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 65 AND 90
                                        OR
                                        ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 97 AND 122)
                                ORDER BY
                                    Tl.SequenceNumber
                                FOR XML PATH('')), 1, 0, '');


PRINT @Phrase_LettersOnly;

 

 

 

I’ve always found using this method to be a bit non-intuitive and fiddly.  Thankfully SQL Server 2017 has added a new function called STRING_AGG() that we can use instead:

--  Fun With Palindromes Query 1.07
--  Using STRING_AGG() to put the letters-only string back together.

USE Admin;
GO

DECLARE
    @Phrase                 VARCHAR(50),
    @Phrase_LettersOnly     VARCHAR(50);

SET @Phrase = 'A man, a plan, a canal, Panama!';


SELECT
    @Phrase_LettersOnly = STRING_AGG(SUBSTRING(@Phrase, Tl.SequenceNumber, 1), '') WITHIN GROUP (ORDER BY Tl.SequenceNumber)
FROM
    dbo.Tally Tl
WHERE
    Tl.SequenceNumber <= LEN(@Phrase)
    AND
    (ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 65 AND 90
        OR
        ASCII(SUBSTRING(@Phrase, Tl.SequenceNumber, 1)) BETWEEN 97 AND 122);


PRINT @Phrase_LettersOnly;

 

 

 

Wow, that’s a lot of queries, so I’ll bring part one of this series to a close.  I look forward to seeing some other solutions in the comments below.  In part two, we’ll compare the performance of these methods across both a single string, and a whole table of them!

 

I’m Speaking – December 2017

After a relatively light speaking schedule in November – just PASS Summit this year – I’ll be back with a vengeance in December.  I’ve got three sessions lined up, and all in venues in which I’ll be presenting for the first time!

I’ll be kicking the month off bright and early on Friday, December 1st with the winter edition of the GroupBy conference.  This is a virtual conference held over the first two Fridays in December, with ten sessions voted in by the community.  I was lucky enough to have my SQL Server R Services session selected this time around, and I hope to “see” you all there at 10:30 am EST (17:30 UTC).

I’ll be following that up with a trip to the nation’s capital the following week to present the same session at SQL Saturday Washington DC on Saturday, December 9th.  I was able to attend last year to see some great presentations, but this year the organizers have made the mistake of asking me to speak too!  I’m looking forward to following the excellent Kevin Feasel [b|t] once again as our sessions complement each other’s quite well.

For my final speaking slot of the year, I’ll be presenting virtually to the Orange County BIG PASS group headed up by my friend Mickey Stuewe [b|t].  I’ll be dusting off an older presentation on implementing row-based security through a single custom solution that works for both SSRS and SSAS.  If you’ll be in the Lake Forest area on Thursday, December 14th, please register on their Meetup group and join me for this one.

What a great way to close out another busy year!