Login or e-mail Password   

I Knew How To Validate An Email Address Until I Read The RFC

Raise your hand if you know how to validate an email address. For those of you with your hand in the air, put it down quickly before someone sees you. It’s an odd site to...
Views: 2.969 Created 09/12/2007

Raise your hand if you know how to validate an email address. For those of you with your hand in the air, put it down quickly before someone sees you. It’s an odd site to see someone sitting alone at the keyboard raising his or her hand. I was speaking metaphorically.

at-sign Before yesterday I would have raised my hand (metaphorically) as well. I needed to validate an email address on the server. Something I’ve done a hundred thousand times (seriously, I counted) using a handy dandy regular expression in my personal library.

This time, for some reason, I decided to take a look at my underlying assumptions. I had never actually read (or even skimmed) the RFC for an email address. I simply based my implementation on my preconceived assumptions about what makes a valid email address. You know what they say about assuming.

What I found out was surprising. Nearly 100% of regular expressions on the web purporting to validate an email address are too strict.

It turns out that the local part of an email address, the part before the @ sign, allows a lot more characters than you’d expect. According to section 2.3.10 of RFC 2821 which defines SMTP, the part before the @ sign is called the local part (the part after being the host domain) and it is only intended to be interpreted by the receiving host...

Consequently, and due to a long history of problems when intermediate hosts have attempted to optimize transport by modifying them, the local-part MUST be interpreted and assigned semantics only by the host specified in the domain part of the address.

Section section 3.4.1 of RFC 2822 goes into more detail about the specification of an email address (emphasis mine).

An addr-spec is a specific Internet identifier that contains a locally interpreted string followed by the at-sign character ("@", ASCII value 64) followed by an Internet domain.  The locally interpreted string is either a quoted-string or a dot-atom.

A dot-atom is a dot delimited series of atoms. An atom is defined in section 3.2.4 as a series of alphanumeric characters and may include the following characters (all the ones you need to swear in a comic strip)...

! $ & * - = ^ ` | ~ # % ' + / ? _ { }

Not only that, but it’s also valid (though not recommended and very uncommon) to have quoted local parts which allow pretty much any character. Quoting can be done via the backslash character (what is commonly known as escaping) or via surrounding the local part in double quotes.

RFC 3696, Application Techniques for Checking and Transformation of Names, was written by the author of the SMTP protocol (RFC 2821) as a human readable guide to SMTP. In section 3, he gives some examples of valid email addresses.

These are all valid email addresses!

Note: Gotta love the author for using my favorite example person, Joe Blow.

Quick, run these through your favorite email validation method. Do they all pass?

For fun, I decided to try and write a regular expression (yes, I know I now have two problems. Thanks.) that would validate all of these. Here’s what I came up with. (The part in bold is the local part. I am not worrying about checking my assumptions for the domain part for now.)

^(?!\.)("([^"\r\\]|\\["\r\\])*"|([-a-z0-9!#$%&'*+/=?^_`{|}~] |([email protected][a-z0-9][\w\.-]*[a-z0-9]\.[a-z][a-z\.]*[a-z]$

Note that this expression assumes case insensitivity options are turned on (RegexOptions.IgnoreCase for .NET). Yeah, that’s a pretty ugly expression.

I wrote a unit test to demonstrate all the cases this test covers. Each row below is an email address and whether it should be valid or not.

[RowTest]
[Row(@"NotAnEmail", false)]
[Row(@"@NotAnEmail", false)]
[Row(@"""test\\blah""@example.com", true)]
[Row(@"""test\blah""@example.com", false)]
[Row("\"test\\\rblah\"@example.com", true)]
[Row("\"test\rblah\"@example.com", false)]
[Row(@"""test\""blah""@example.com", true)]
[Row(@"
""test""blah""@example.com", false)]
[Row(@"
customer/[email protected]", true)]
[Row(@"
[email protected]", true)]
[Row(@"
!def!xyz%[email protected]", true)]
[Row(@"
[email protected]", true)]
[Row(@"
[email protected]", true)]
[Row(@"
[email protected]", false)]
[Row(@"
[email protected]", false)]
[Row(@"
[email protected]", false)]
[Row(@"
[email protected]", false)]
[Row(@"
""[email protected]""@example.com", true)]
[Row(@"
[email protected]", true)]
[Row(@"
""Ima.Fool""@example.com", true)]
[Row(@"
""Ima Fool""@example.com", true)]
[Row(@"
Ima [email protected]", false)]
public void EmailTests(string email, bool expected)
{
string pattern = @"
^(?!\.)(""([^""\r\\]|\\[""\r\\])*""|"
+ @"
([-a-z0-9!#$%&'*+/=?^_`{|}~]|(?<!\.)\.)*)(?<!\.)"
+ @"@[a-z0-9][\w\.-]*[a-z0-9]\.[a-z][a-z\.]*[a-z]$";

Regex regex = new Regex(pattern, RegexOptions.IgnoreCase);
Assert.AreEqual(expected, regex.IsMatch(email)
, "Problem with '
" + email + "'. Expected "
+ expected + "
but was not that.");
}

Before you call me a completely anal nitpicky numnut (you might be right, but wait anyways), I don’t think this level of detail in email validation is absolutely necessary. Most email providers have stricter rules than are required for email addresses. For example, Yahoo requires that an email start with a letter. There seems to be a standard stricter set of rules most email providers follow, but as far as I can tell it is undocumented.

I think I’ll sign up for an email address like phil.h\@\@[email protected] and start bitching at sites that require emails but don’t let me create an account with this new email address. Ooooooh I’m such a troublemaker.

The lesson here is that it is healthy to challenge your preconceptions and assumptions once in a while and to never let me near an RFC.

UPDATES: Corrected some mistakes I made in reading the RFC. See! Even after reading the RFC I still don’t know what the hell I’m doing! Just goes to show that programmers can’t read. I updated the post to point to RFC 822 as well. The original RFC.

Source: http://haacked.com/archive/(...)-address-until-i.aspx 

Similar articles


6
comments: 0 | views: 3311
4
comments: 1 | views: 7689
2
comments: 1 | views: 3196
15
comments: 3 | views: 26254
6
comments: 0 | views: 7403
6
comments: 0 | views: 2898
6
comments: 1 | views: 7847
 
Author
Article




No messages


Add your opinion
You must be logged in to write a comment. If you're not a registered member, please register. It takes only few seconds, and you get an access to additional functions .
 


About EIOBA
Articles
Explore
Publish
Community
Statistics
Users online: 201
Registered: 107.587
Comments: 1.493
Articles: 7.171
© 2005-2018 EIOBA group.