Friday, June 04, 2010

TSQL for extracting email address from text field

I was working on a project which involved data migration and I needed a way to extract email addresses out of the notes field.  So here is what I came up with.

Create FUNCTION svf_ExtractEmailAddress 
(
 @notes varchar(1000), @incident int /*1 to get the first occurance, 2 to get the second, etc*/
)
RETURNS varchar(100)
AS
BEGIN
 declare @result varchar(100), @iStart int, @iMid int, @iEnd int, @temp varchar(1000), @incidentCounter int
 
 --replace newline (10/13) and tab(9) with space
 set @notes = REPLACE(REPLACE(REPLACE(@notes,CHAR(13),' '), char(10), ' '), char(9), ' ')
 set @incidentCounter=1
 
 set @iMid = patindex('%[A-Z1-9]@[A-Z1-9]%', @notes) --the location of the character righth before the @ sign
 
 while (@incidentCounter < @incident and @iMid > 0)
 Begin  
  set @notes = substring(@notes, @iMid +1, 1000)
  set @iMid = patindex('%[A-Z1-9]@[A-Z1-9]%', @notes) --the location of the character righth before the @ sign
  set @incidentCounter = @incidentCounter+1
 End
 
 if (@iMid =0) return ''
 
 --locate the first space before the at symbol
 set @temp = substring(@notes,0, @iMid )-- temp contains everything BEFORE the @sign
 set @iStart = charindex(' ', @temp)
 set @temp = substring(@notes,@iStart, @iMid-@iStart )
 while (charindex(' ', @temp)>0) 
 Begin
  set @iStart = @iStart + charindex(' ', @temp)
  set @temp = substring(@notes,@iStart, @iMid-@iStart)
 End
 --locate the first space after the @ symbol
 set @temp = substring(@notes,@iMid, 1000 ) -- temp contains everything AFTER the @sign
 set @iEnd = patindex('%[ ,:;'+CHAR(13)+CHAR(10)+']%', @temp)
 set @iEnd = 
  case when @iEnd=0 then 1000 
  else @iMid + @iEnd -1 end
 /*
 set @result = '@iMid: ' + convert(varchar(9), @iMid) 
  + ' @iStart: ' + convert(varchar(9), @iStart) 
  + ' @iEnd: ' + convert(varchar(9), @iEnd) 
  + ' email: ' + substring (@notes, @istart, @iEnd - @istart)
  --*/
 set @result = substring (@notes, @istart, @iEnd - @istart)
 RETURN  @result 
END

Example usage:

declare @note varchar(1000) set @note= 'bird@gmail.com hello happy chipmunk@yahoo.com, you are nice guy so dog@gmail.com cat@gmail.com'
select e1=dbo.svf_ExtractEmailAddress(@note, 1), e2=dbo.svf_ExtractEmailAddress(@note, 2),
 e3=dbo.svf_ExtractEmailAddress(@note, 3), e4=dbo.svf_ExtractEmailAddress(@note, 4), 
 e5=dbo.svf_ExtractEmailAddress(@note, 5)

And here is the result:

5 comments:

Anonymous said...

This is failing for me when I run the following:

DECLARE @txt VARCHAR(1000)

SELECT @txt = 'I am html text with an email address somewhere... email'

SELECT email = dbo.svf_ExtractEmailAddress(@txt, 1)

Unknown said...

Line 36 actually makes the proc think that 'n' should be treated as a blank so 'abc@MSN.com' will return 'abs@MS'. I removed the '/n' because shouldn't line 11 already take care of that? Either way this is a great script that I will be saving. Thanks a lot

Unknown said...
This comment has been removed by the author.
Paradigm said...

I updated the stored proc. I assumed new lines woulbe be delimited like this \n, but that isn't the case. instead i am using char(10) and char(13) now.

good catch. I hope it's useful to you.

M1gu3l_Villegas said...

Thank you. Worked fine for me.