Site icon R-bloggers

{emayili}: Rudimentary Email Address Validation

[This article was first published on R | datawookie, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

A recent issue on the {emayili} GitHub repository prompted me to think a bit more about email address validation. When I started looking into this I was somewhat surprised to learn that it’s such a complicated problem. Who would have thought that something as apparently simple as an email address could be linked with such complexity?

There are two separate problems:

  1. Is the email address valid or authentic? Does it work?
  2. Is the email address syntactically correct? Does it obey the rules?

We’ll be focusing on the second problem. But before we delve into that, let’s load {emayili}.

library(emayili)

Does It Work?

If you want to validate an email address, send it an email. Problem solved. There’s only ONE way to validate an email address

The most effective means to answer this question is to send a test message to the address and potentially request a response. There are a few possible outcomes, some of which are:

  • The test message bounces. It’s rejected by the server, so the mailbox doesn’t exist?
  • The test message appears to be delivered but there’s no response. Perhaps the mailbox exists, but it’s unused or a throwaway.
  • The test message is delivered and there’s a response. It exists! ?

Is it Syntactically Correct?

Not surprisingly, there are rules which dictate what constitutes a syntactically correct email address. Before we think about the rules, let’s talk about the parts of an email address.

An email address consists of two parts: the local part and the domain. For example, alice@example.com. An optional display name may precede the email address, in which case the email address is enclosed in angle brackets. For example, Alice Jones <alice@example.com>.

There are various rules which apply to the local part and the domain, the most important (IMHO) of which are summarised below. Note: This is a high level summary and neglects some of the nuances.

Rules: Local Part

Rules for the local part of an email address:

  • May be up to 64 characters long.
  • May be quoted (like "alice") or unquoted (just alice).
  • If unquoted may consist of the following ASCII characters:
  • lower and uppercase Latin letters (a to z and A to Z)
  • digits (0 to 9)
  • various printable punctuation marks and
  • dots (.) but not at the beginning or end or more than one in succession.
  • If quoted then the rules are a lot more liberal.
  • May contain comments (which are enclosed in parentheses).

Rules: Domain

Rules for the domain of an email address:

  • May be up to 255 characters long.
  • Must satisfy the requirements of a hostname or IP address (in which case it must be enclosed in square brackets).
  • May contain comments (which are enclosed in parentheses).

The address Class

An address class has been added to emayili.

alice <- address("alice@example.com")
(bob <- address(email = "bob@yahoo.co.uk", display = "Robert Brown"))
[1] "Robert Brown <bob@yahoo.co.uk>"

You can construct address objects from local name and domain.

address(local = "alice", domain = "example.com")
[1] "alice@example.com"

It’s vectorised and does recycling, so you can also do this sort of thing:

address(local = c("bob", "erin"), domain = "yahoo.co.uk")
[1] "bob@yahoo.co.uk"  "erin@yahoo.co.uk"

This also works well in a pipeline. First let’s set up a tibble with the details of a few email accounts.

recipients <- tibble(
  email = c(NA, NA, NA, "oscar@windmill.nl"),
  local = c("alice", "erin", "bob", NA),
  domain = c("example.com", "yahoo.co.uk", "yahoo.co.uk", NA),
  display = c(NA, NA, "Robert Brown", NA)
)
recipients
# A tibble: 4 × 4
  email             local domain      display     
  <chr>             <chr> <chr>       <chr>       
1 <NA>              alice example.com <NA>        
2 <NA>              erin  yahoo.co.uk <NA>        
3 <NA>              bob   yahoo.co.uk Robert Brown
4 oscar@windmill.nl <NA>  <NA>        <NA>        

Now use invoke() to call address() for each record.

library(purrr)

recipients <- recipients %>%
  invoke(address, .)
[1] "alice@example.com"              "erin@yahoo.co.uk"              
[3] "Robert Brown <bob@yahoo.co.uk>" "oscar@windmill.nl"             

? This could equally be done with do.call().

Email addresses can also be coerced into address objects using as.address().

as.address("Robert <bob@yahoo.co.uk>")
[1] "Robert <bob@yahoo.co.uk>"

Methods

There are methods for extracting the email address and display name.

raw(bob)
[1] "bob@yahoo.co.uk"
display(bob)
[1] "Robert Brown"

You can also get the local part and the domain.

local(bob)
[1] "bob"
domain(bob)
[1] "yahoo.co.uk"

Parties

There’s also a new function, parties(), for extracting the addresses associated with an email.

email <- envelope() %>%
  from("alice@example.com") %>%
  to("erin@yahoo.co.uk", "Robert <bob@yahoo.co.uk>") %>%
  cc("oscar@windmill.nl") %>%
  bcc("olivia@hotmail.com")

parties(email)
# A tibble: 5 × 6
  type                   address display raw                local  domain     
  <chr>               <vctrs_dd> <chr>   <chr>              <chr>  <chr>      
1 From         alice@example.com <NA>    alice@example.com  alice  example.com
2 To            erin@yahoo.co.uk <NA>    erin@yahoo.co.uk   erin   yahoo.co.uk
3 To    Robert <bob@yahoo.co.uk> Robert  bob@yahoo.co.uk    bob    yahoo.co.uk
4 Cc           oscar@windmill.nl <NA>    oscar@windmill.nl  oscar  windmill.nl
5 Bcc         olivia@hotmail.com <NA>    olivia@hotmail.com olivia hotmail.com

The details of all of the addresses on the email, broken down in a nice tidy format.

Normalisation

Sometimes email address data can be dirty, so the address class tries to sanitise its contents.

as.address("       Robert       <   bob    @    yahoo.co.uk   >")
[1] "Robert <bob@yahoo.co.uk>"

This is very simple at the moment, but I’m planning on adding more functionality.

Compliance

Finally, a test of whether an email address complies with the syntax rules.

First some good addresses.

compliant(c(
  "alice@example.com",
  "Robert <bob@yahoo.co.uk>",
  "olivia@hotmail.com"
))
[1] TRUE TRUE TRUE

Now some evil addresses.

compliant(c(
  "alice?example.com",
  "Robert bob@yahoo.co.uk",
  "olivia@hot_mail.com"
))
[1] FALSE FALSE FALSE

The implementation of compliant() uses regular expressions. Take a look at this StackOverflow thread for a lengthy discussion on the use of REGEX for checking emails. It’s not perfect, but it’s functional. If you discover cases where it fails, please let me know and I’ll improve the logic.

Conclusion

For the most part you won’t need to worry about these checks. They will all happen in the background when you put together an email. If, however, one of your email addresses is problematic, then you should at least know about it before you send the email.

These updates are available in {emayili}-0.4.16.

To leave a comment for the author, please follow the link and comment on their blog: R | datawookie.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.