A little bit of creative stuff

My review: Lenovo Yoga 300-11IBY laptop and taming updates in W10

2016-01-02Some SQL Server functions for string matching


Tags All Personal Tech

That rare thing... a work-related blog post, albeit a year after I actually wrote most of it. When I set paws to keyboard I'd recently ordered a copy of Peter Christen's Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection so it seemed a good time to have a look around for practical code samples that could be easily adapted to the environment I'm using, which is SQL Server. It's a good book, by the way, as well as serving to point out to me that the word "token" has a more specific meaning in data matching than general parlance and it'd be clearer if I stuck to "key" when explaining things to people.

There's a download at the end of the article with everything discussed compiled into one ready-to-use assembly with a highly original name of MatchingFunctions. If I make meaningful changes or fixes in future I'll revisit this entry, and if you find any particularly grievous errors then do let me know.

Regular Expressions

Regular expressions are a way of matching patterns in strings, and in my not-so-humble-opinion software that attempts to do anything meaningful with text but doesn't include (or can't easily have added in) a good regex implementation is basically a toy.

I generally use them more days than not for simple things like trimming whitespace and extra line breaks from blocks of text in a text editor (shoutout to Notepad++) but their versatility is what inspires religious zeal in adherents. Even something as simple as picking out the first distinct number in a sequence can be extremely useful if you need to do it more than a few times. More advanced regex tend to look like random characters interspersed with brackets, and they can be a bit tricky to debug just by visual inspection, but it's no exaggeration to say that they make practical many things that would otherwise not be. I'll probably write about using them in other applications such as Excel on other occasions, but if that might be of use to you (in Excel) right now then check out www.tmehta.com.

Making regex goodness available to SQL Server was straightforward with the examples provided by:

https://www.simple-talk.com/sql/t-sql-programming/clr-assembly-regex-functions-for-sql-server-by-example/ (archive.org)

Phil Factor's post (and the siren call of regular expressions) were what got me interested in common language runtimes (CLR) with SQL Server. What you're doing with an assembly in SQL Server is embedding compiled code into a database because it'll almost always execute faster than the equivalent code written as Transact-SQL. In this case, pattern matching much beyond the LIKE operator and PATINDEX function isn't really possible in plain T-SQL and it's better to turn elsewhere.

Various functions are given as examples, but I most often use RegExReplace because I'm looking to modify records in order to do crude cleansing of data on-the-fly. Note that the parameter order and options used with it are equivalent to those in SQL Server's own REPLACE function; RegExReplaceX is provided to fit with the other new functions. The main thing to remember if specifying options is that a value of 1 makes the functions that take options case-insensitive, assuming you're not wrapping input in UPPER or LOWER.


Here's a very simple (but useful) example:

SELECT dbo.RegexReplace('Get this 120 string 723782 minus any 
222888 non-numeric characters.','[^0-9]','')

And a more complicated one to pick a postcode out from a block of text, with the caveat that postcodes are a moving standard and it's usually better to go with a weaker match than to reject ones that might be valid:

SELECT dbo.RegExMatch('([A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][ABD-HJLNP-UW-Z]{2})',
'Miss S Pollard, 1 Chapel Hill, Heswall, BOURNEMOUTH, BH1 1AA is an address used by 
Royal Mail in examples',1)

It's also worth noting that whilst regular expressions are incredibly useful in the context of data cleansing for doing things like removing non-alphanumeric characters, be careful not to remove information that may be relevant. For instance, given a string that contains a street address you might want to first replace characters that are treated as separators (such as commas and hyphens) with spaces.

SELECT dbo.RegExReplace(dbo.RegExReplace(ISNULL([address],''),
'[&:;,.@+-/\\]',' '), '[^0-9a-zA-Z ]','') AS cleaner_address

Being CLR code this scales well to regularly dealing with hundreds of thousands of records, even on a low-spec box. Of course it isn't optimal to do so (it would be preferable to cleanse data at the source and at input thereafter) but it's rare to have the luxury to do things once and do them properly.

Further reading:

Double Metaphone

If you've ever done anything with data matching you've probably heard of Soundex, a phonetic algorithm developed about a century ago for encoding the approximate English pronunciation of words (chiefly names) to match similar-sounding words. It's available in Excel, SQL Server, and many other software products. Unfortunately, most implementations of Soundex rely on the first letter of the word being consistent. This is where Double Metaphone comes in, and it also tries to consider alternative pronunciations. If you're particularly interested in non-English languages you might like to look into work by genealogy researchers such as Beider-Morse Phonetic Matching.

The double metaphone code I'm using is a 2005 implementation done by Michael Coles when Microsoft's 2005 product lines were still in beta, based on the 1990 implementation by Lawrence Philips plus some additional work by Kevin Atkinson.

http://www.sqlservercentral.com/articles/.Net/doublemetaphonephonecticmatching/2063/ (archive.org)

It provides the following functions of interest;


The first function returns an integer between 0 and 3, higher numbers being a stronger match because they involve more primary encoding matches. Note that you shouldn't give the function strings as input, it takes its own specially defined double metaphone result type which encapsulates both metaphones in one object... i.e.

SELECT dbo.DoubleMetaphoneCompare(dbo.DoubleMetaphoneEncode('String1'),

The second function exposes the phonetic approximations as individual strings, should you want to concentrate on the English pronunciation or just see what character encodings (up to four characters are used) come out of the sausage machine. Examples:

SELECT dbo.DoubleMetaphoneEncode('marketing').First => MRKT
SELECT dbo.DoubleMetaphoneEncode('advertising').First => ATFR

Because it's designed so that each letter of the encoding is roughly a syllable, Double Metaphone doesn't work well with short words compared to longer variants a human would intuitively consider to be extremely similar, such as the names 'smith' (SM0) and 'smithers' (SM0R). There's definitely some value in checking if the first two or three metaphone characters are the same. You could simply compare the first few characters of the original strings, but phonetic approximations are worth considering and benchmarking for your applications.

There's a third iteration of Metaphone called Metaphone 3 which is commercially available from the same principal author who developed the other two, but in recent years has been integrated into a large Google project under a BSD licence. It appears the author didn't realise how permissive the licence in question is when approving the inclusion, and from the comments storm that erupted online in late December 2014 I'd expect to see more open implementations in future, particularly as a US patent application has been rejected.

Levenshtein Distance

Levenshtein distance is essentially the number of characters you'd need to edit in a string to turn it into another string, whether that be by deleting, adding or substituting characters. It lets you work around some typos and spelling variations when searching. For example, "smith" to "smythe" has an edit distance of 2 because you need to substitute a 'y' and add an 'e'.

As well as (maybe more than) the number of edits, it can be useful to know how similar two strings are, which is often given as a percentage. In the previous example, "smith" is two thirds or 66% similar to "smythe" which makes it a reasonably good match.

Having encountered some very sluggish Levenshtein implementations in the past, I started by looking at an implementation that claimed to be optimised and hasn't appeared to receive any significant valid criticism since being published about eight years ago.

http://www.codeproject.com/Articles/13525/Fast-memory-efficient-Levenshtein-algorithm (archive.org)

Sten Hjelmqvist's implementation gave percentage difference, so I've just added a function to get the distance.


Longest Common

In reading around I also found these and adapted some, adding a few other similar or related functions. All are case sensitive and the word functions split on space.

http://en.wikibooks.org/wiki/Algorithm_Implementation/Strings (archive.org)


You might also find helpful Phil Factor's work on non-CLR functions to look for common strings, and look back through his 2015 articles for more on this and Levenshtein.

Compiling and installation

To recap, what you're doing with an assembly in SQL Server is embedding compiled code into a database. Specifically, we're compiling visual basic (VB) into dynamic link libraries (DLLs) and then converting the binary files to assembly bits that can be used in T-SQL without reference to the files. This means you can use the included SQL install script without having to do additional work, just by running it against a chosen database, but full source is provided for those that want to work through the intermediary stages.

Although Windows 7 includes the .NET framework (version 3.5.1 and previous) as an OS feature to get it working and make a command line compiler available you may need to fish around in Control Panel under "Turn Windows features on or off" if other software hasn't already enabled it. On my 32-bit system the various legacy versions are installed in subfolders of C:\Windows\Microsoft.NET\Framework and since %windir% isn't indexed by default just searching for name:vbc doesn't make this immediately obvious.

Personally I've opted to target 2.0 because that's the vintage of some of the code. One thing to note when compiling is that the filename of the output DLL is embedded by vbc.exe into the resultant file, so don't just use a generic output filename such as binary.dll, since SQL Server won't like having more than one assembly that identifies itself by the same name. I've included a very simple 'GetAssemblyBits' command line utility for encoding the DLL files to script-safe text that can be used in CREATE ASSEMBLY statements. As with the rest of the code, source is included in the download package.


Although I've used most of this code to do things myself, the usual disclaimer applies that none of this should be regarded as having any kind of warranty. What I've done is nothing more than light modification of code released by generous fellow geeks that's assumed (by virtue of publication on instructional blogs and comments) to be free to grok and abuse, and that abuse extends to having auto-converted some of it from C# to VB before making edits. I can't stress enough that you should test thoroughly if you're thinking of using any of this as the basis for anything important, and don't come crying if it borks your production server. Enjoy.



Tags All Personal Tech