Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.
A recent issue on the {emayili} GitHub repository prompted me to think a bit more about email address validation. When I started looking into this I was somewhat surprised to learn that it’s such a complicated problem. Who would have thought that something as apparently simple as an email address could be linked with such complexity?
There are two separate problems:
- Is the email address valid or authentic? Does it work?
- Is the email address syntactically correct? Does it obey the rules?
We’ll be focusing on the second problem. But before we delve into that, let’s load {emayili}
.
library(emayili)
Does It Work?
If you want to validate an email address, send it an email. Problem solved. There’s only ONE way to validate an email address
The most effective means to answer this question is to send a test message to the address and potentially request a response. There are a few possible outcomes, some of which are:
- The test message bounces. It’s rejected by the server, so the mailbox doesn’t exist?
- The test message appears to be delivered but there’s no response. Perhaps the mailbox exists, but it’s unused or a throwaway.
- The test message is delivered and there’s a response. It exists! ?
Is it Syntactically Correct?
Not surprisingly, there are rules which dictate what constitutes a syntactically correct email address. Before we think about the rules, let’s talk about the parts of an email address.
An email address consists of two parts: the local part and the domain. For example, alice@example.com
. An optional display name may precede the email address, in which case the email address is enclosed in angle brackets. For example, Alice Jones <alice@example.com>
.
There are various rules which apply to the local part and the domain, the most important (IMHO) of which are summarised below. Note: This is a high level summary and neglects some of the nuances.
Rules: Local Part
Rules for the local part of an email address:
- May be up to 64 characters long.
- May be quoted (like
"alice"
) or unquoted (justalice
). - If unquoted may consist of the following ASCII characters:
- lower and uppercase Latin letters (
a
toz
andA
toZ
) - digits (
0
to9
) - various printable punctuation marks and
- dots (
.
) but not at the beginning or end or more than one in succession. - If quoted then the rules are a lot more liberal.
- May contain comments (which are enclosed in parentheses).
Rules: Domain
Rules for the domain of an email address:
- May be up to 255 characters long.
- Must satisfy the requirements of a hostname or IP address (in which case it must be enclosed in square brackets).
- May contain comments (which are enclosed in parentheses).
The address
Class
An address
class has been added to emayili
.
alice <- address("alice@example.com") (bob <- address(email = "bob@yahoo.co.uk", display = "Robert Brown")) [1] "Robert Brown <bob@yahoo.co.uk>"
You can construct address
objects from local name and domain.
address(local = "alice", domain = "example.com") [1] "alice@example.com"
It’s vectorised and does recycling, so you can also do this sort of thing:
address(local = c("bob", "erin"), domain = "yahoo.co.uk") [1] "bob@yahoo.co.uk" "erin@yahoo.co.uk"
This also works well in a pipeline. First let’s set up a tibble
with the details of a few email accounts.
recipients <- tibble( email = c(NA, NA, NA, "oscar@windmill.nl"), local = c("alice", "erin", "bob", NA), domain = c("example.com", "yahoo.co.uk", "yahoo.co.uk", NA), display = c(NA, NA, "Robert Brown", NA) ) recipients # A tibble: 4 × 4 email local domain display <chr> <chr> <chr> <chr> 1 <NA> alice example.com <NA> 2 <NA> erin yahoo.co.uk <NA> 3 <NA> bob yahoo.co.uk Robert Brown 4 oscar@windmill.nl <NA> <NA> <NA>
Now use invoke()
to call address()
for each record.
library(purrr) recipients <- recipients %>% invoke(address, .) [1] "alice@example.com" "erin@yahoo.co.uk" [3] "Robert Brown <bob@yahoo.co.uk>" "oscar@windmill.nl"
? This could equally be done with do.call()
.
Email addresses can also be coerced into address
objects using as.address()
.
as.address("Robert <bob@yahoo.co.uk>") [1] "Robert <bob@yahoo.co.uk>"
Methods
There are methods for extracting the email address and display name.
raw(bob) [1] "bob@yahoo.co.uk" display(bob) [1] "Robert Brown"
You can also get the local part and the domain.
local(bob) [1] "bob" domain(bob) [1] "yahoo.co.uk"
Parties
There’s also a new function, parties()
, for extracting the addresses associated with an email.
email <- envelope() %>% from("alice@example.com") %>% to("erin@yahoo.co.uk", "Robert <bob@yahoo.co.uk>") %>% cc("oscar@windmill.nl") %>% bcc("olivia@hotmail.com") parties(email) # A tibble: 5 × 6 type address display raw local domain <chr> <vctrs_dd> <chr> <chr> <chr> <chr> 1 From alice@example.com <NA> alice@example.com alice example.com 2 To erin@yahoo.co.uk <NA> erin@yahoo.co.uk erin yahoo.co.uk 3 To Robert <bob@yahoo.co.uk> Robert bob@yahoo.co.uk bob yahoo.co.uk 4 Cc oscar@windmill.nl <NA> oscar@windmill.nl oscar windmill.nl 5 Bcc olivia@hotmail.com <NA> olivia@hotmail.com olivia hotmail.com
The details of all of the addresses on the email, broken down in a nice tidy format.
Normalisation
Sometimes email address data can be dirty, so the address
class tries to sanitise its contents.
as.address(" Robert < bob @ yahoo.co.uk >") [1] "Robert <bob@yahoo.co.uk>"
This is very simple at the moment, but I’m planning on adding more functionality.
Compliance
Finally, a test of whether an email address complies with the syntax rules.
First some good addresses.
compliant(c( "alice@example.com", "Robert <bob@yahoo.co.uk>", "olivia@hotmail.com" )) [1] TRUE TRUE TRUE
Now some evil addresses.
compliant(c( "alice?example.com", "Robert bob@yahoo.co.uk", "olivia@hot_mail.com" )) [1] FALSE FALSE FALSE
The implementation of compliant()
uses regular expressions. Take a look at this StackOverflow thread for a lengthy discussion on the use of REGEX for checking emails. It’s not perfect, but it’s functional. If you discover cases where it fails, please let me know and I’ll improve the logic.
Conclusion
For the most part you won’t need to worry about these checks. They will all happen in the background when you put together an email. If, however, one of your email addresses is problematic, then you should at least know about it before you send the email.
These updates are available in {emayili}
-0.4.16.
R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.