In my last post, I mentioned that I started the process of upgrading Stack Overflow to SQL Server 2019. This week I tackled our first production servers and after upgrading, we hit a small issue aka a gotcha because we were using an old system view. Below is a recap of what I encountered.
A Little Background
The servers I upgraded were the three SQL Servers that run Stack Overflow for Teams. For those not familiar with our setup, we use availability groups (AGs) across all of our main SQL Servers. This allows us to use the primary server for read/writes, and use the secondaries for a lot of the read-only traffic. Since we utilize both the primary and secondaries for read purposes, logins need to be the same across all of the servers in the AGs. In order to keep the logins in sync, we have a job that periodically runs on each server and creates new logins using dynamic SQL.
Upgrade Day
Since I previously upgraded our servers to SQL Server 2017, I knew what to expect as I moved through the cluster. Once I started with the secondary servers the databases would not be accessible until the failover was complete. After upgrading a secondary, I would see this if I tried to access a database:
This also meant that any of the jobs touching the databases would fail, so I disabled SQL Agent jobs to minimize noisy alerts to my email. After the upgrade and failover, I’d enable all the jobs and be on my merry way.
The upgrade of both secondary servers went totally as planned, without any issues. The evening rolled around and despite Nick Craver’s popcorn, the failover was successful, and we finally had SQL Server 2019 running in production.
I upgraded the last server in the cluster (the former primary), and did some other clean-up tasks, including enabling all the SQL Agent jobs, before calling it a night. Once the jobs were enabled I started getting emails that one was failing — the Login Replication job.
Something’s broken - time to investigate.
The Failing Job
I pulled up the history on the job and saw the error:
Message
Executed as user: <username>. Invalid value given for parameter PASSWORD. Specify a valid parameter value. [SQLSTATE 42000] (Error 15021). The step failed.
Sigh ok, something is really broken because this was working before we failed over.
The code for the login replication basically does the following via a cursor (yeah, I know, but it works…normally):
- Select from the primary via
OPENQUERY
to query the logins and passwords - Using
sp_hexadecimal
convert thevarbinary
password to astring
value - Create a string to be executed, i.e. dynamic SQL that runs a
CREATE LOGIN
I have trimmed the code because it’s long, but the key parts query the login and password.
|
|
The value of pwd_varbinary
and sid
are then used with sp_hexadecimal
:
|
|
Finally, we concatenate them into a SQL string that we execute:
|
|
When I ran the SELECT
statement, I noticed the issue — the value of pwd_varbinary
was null
which obviously was wrong, and since we were passing a null
as the password to sp_hexadecimal
the job was failing.
Great, so now what?
The Fix
We couldn’t just stop replicating the logins across all servers, due to our usage of the secondaries, and we needed to do this automatically on a regular interval. We had to figure out a solution. Thankfully the fix was easy.
After seeing that the null
value of the password was coming from sys.syslogins
, and since we were already querying sys.sql_logins
we could easily replace:
|
|
from sys.syslogins
with
|
|
from sys.sql_logins
(h/t to Nick Craver). With this one change to the query, the new SELECT
statement was:
|
|
Once it was updated, the job successfully ran, and our logins were replicating again without issue. Now, I really called it a night.
Ok, but what was the gotcha?
Remember I said that this was working fine on SQL Server 2017?
When we started looking for solutions, I looked up the Microsoft Docs for sys.syslogins
and right at the top of the doc it says:
This SQL Server 2000 system table is included as a view for backward compatibility. We recommend that you use the current SQL Server system views instead. To find the equivalent system view or views, see Mapping System Tables to System Views (Transact-SQL). This feature will be removed in a future version of Microsoft SQL Server. Avoid using this feature in new development work, and plan to modify applications that currently use this feature.
The doc shows the description and value of password
:
Column | Data type | Description |
---|---|---|
password | nvarchar(128) | Returns NULL. |
Hmm, that’s weird because the password
column contains an actual value in SQL Server 2017. Something obviously changed. Yes, we were risky because we were using an old view, but the important thing is the value in the password
column changed, and it bit us.
In SQL Server 2017, there is still a value for password
in sys.syslogins
, but in SQL Server 2019 it is now null
.
If this was mentioned somewhere, I missed it. Thankfully, this wasn’t a critical job otherwise it could have been more problematic. We don’t have the Login Replication job running in development, so we didn’t hit the issue until we moved to production.
Technically it’s on us because we were using an old system view and didn’t check for changes, but keep in mind if you’re using the sys.syslogins
view anywhere, and rely on the password
, you’ll need to make a code update before moving to SQL Server 2019.
If you’ve upgraded to SQL Server 2019, have you hit any of these issues yet?