Who is Who? Matching taxonomic names

whoiswhoimageIf you have ever used more than one database describing information for a list of species, you probably have struggled with the “joys” of finding the right name matches. I work mostly with mammalian databases (I have compiled a list of many of these useful resources here) and despite being generally well-know species I always run into problems when matching names. There are reputable taxonomic lists like Wilson and Reeder’s but this was published in 2005 and new species have been discovered and some names revised, so not all sources use the same names. Particularly, I often want to match names to IUCN Red List data and their taxonomic list changes regularly. I think their goal is to be up-to-date rather than being taxonomically precise… So after many wasted hours matching databases, often with manual searches, I decided to make a R script with the help of Luis Verde Arregoitia to make the process as automated as possible. We are both sharing the code in our blogs in case it is useful for others too.

The script shows how to match the IUCN Red List status data with a species-level database. Here as an example we use the diet database EltonTraits as an example, but the script could be modified relatively easy to match other databases. You can download a R Markdown file here.

March 2017 UPDATE: Depending on when you download your data names  from the IUCN status data and the IUCN spatial data may not match. This means a matched taxonomy to status may not be good for the distribution range areas. Just something else to keep in mind.

FIRST: we start by loading and retrieving the necessary R packages and datasets. Using the R package taxize you can actually retrieve the IUCN Red List data directly but you need a personal API so I am not showing that here. You can also download the latest version from IUCN Red List. For this example I just uploaded a list obtained on the 5th January 2017 which is available on GitHub.

library(taxize)
library(stringdist)

## IUCN Red List (download and tidy-up)
IUCN <- read.csv("https://raw.githubusercontent.com/ManuelaGonzalez/Who-is-Who/master/Mammals_2017_01_05.csv", stringsAsFactors=FALSE,strip.white=TRUE)
## Generate full species name
IUCN$binomial <- paste(IUCN$Genus, IUCN$Species, sep = " ")
## Tidy-up synonyms list
IUCN$Synonyms <- gsub("  ", " ", IUCN$Synonyms)
IUCN$Synonyms <- gsub("<i>", "", IUCN$Synonyms)
IUCN$Synonyms <- gsub("</i>", "", IUCN$Synonyms)

## EltonTrait database (download and tidy-up)
EltonTraits_original <- read.table("http://www.esapubs.org/archive/ecol/E095/178/MamFuncDat.txt", sep="\t", stringsAsFactors=FALSE, header=T, strip.white=TRUE)
## There are some empty rows in the downloadable file of EltonTraits, these should be removed
EltonTraits <- EltonTraits_original[EltonTraits_original$Scientific!="",]
str(EltonTraits)

SECOND: Although each species is meant to have only one name under our system of nomenclatural rules, there are often several names that are applied under different circumstances. Generally, only one taxa (the oldest) is designated as a priority by the International Commission on Zoological Nomenclature, other names are considered “synonymous” (junior synonym). To add to the confusion, the status of subspecies can be very dynamic. 

The IUCN database contains a list of synomys but even using that list some species may not be matched. The R package taxize allows to search some online sources to define synonyms but still some names may be unmatched. Luis Verde Arregoitia used web scraping to download and wrangle a massive list of synonyms for species and subspecies of mammals, compiled by Christian Boudet for his website Planet Mammifères (Mammals’ Planet). This is not necessarily a formal, peer-reviewed taxonomic database but it is quite up-to-date and the synonyms listed reflect taxonomic changes and the current status of homotypic synonyms. The list of synonyms on the site appears as text with a new line for each species and its synonyms. Because there is consistency in the formatting (species in boldface, synonyms separated by equal signs), the text can be wrangled into a tidy data frame relatively easily.

## web scraping the list of synonyms
library(rvest)
library(stringi)
library(tidyr)
library(dplyr)

## scrape page
fullPage <- read_html("http://www.planet-mammiferes.org/drupal/en/node/37?taxon=1")

## extract node (identified using the selectorgadget browser extension)
## note that the scraped hmtl is saved as a string
synList <- fullPage %>% html_nodes("#main p") %>% toString()

## split strings using the newline html tags, store as matrix
## replace "pattern" in the stri_split_fixed function with a line break html tag
#### (< then b then > with no spaces)... the WordPress MCE editor keeps stripping
###### it from the code block
synListMat <-   stri_split_fixed(synList, "pattern") %>%
  stri_list2matrix() %>%  as.data.frame()

## split into columns (probably more than needed, just being cautious)
## the warning message can be ignored
synSepDF <- separate(synListMat,V1,into = paste("V", 1:9,sep=""), sep = "=") ## which columss are empty synSepDF %>%  summarise_each(funs(100*mean(is.na(.))))
## remove them
synSepDF <-  synSepDF %>% select(V1:V5)

## clean up the first column (species), based on the tags for boldface
synSepDF$V1 <-   stri_replace_all_fixed(synSepDF$V1,c("<strong>","</strong>"),"",vectorize_all = FALSE)

## clean up the rows that don't have species names (still html tags in the cells)
## make sure the regex doesn't need escaping (depending on your OS)
synTable <- synSepDF %>%  filter(!stri_detect_regex(V1,'\\<'))

## rename columns
synTable <- synTable %>% rename(species=V1,syn1=V2,syn2=V3,syn3=V4,syn4=V5)

## species and subspecies
## count words
synTable$ssp <- stri_count_words(synTable$species)
## recode
synTable$subsp <- case_when(synTable$ssp == 2 ~ "species",
                            synTable$ssp == 3 ~ "subspecies")
## clean up
synTable <- synTable %>% select(-ssp)
## trim whitespace
Mammal_planet_all <- synTable %>% mutate_all(stri_trim_both)
## subset and prepare for next step
## remove last variable that separates taxonomic level
Mammal_planet <- subset(Mammal_planet_all, subsp=="species")[,1:5] 

## write to disk (optional)
# write.csv(synTable,file="mammalPLanet.csv")

THIRD: this is the loop to search and match species names from EltonTraits to IUCN. It is a presented as a single loop with several steps that could be done separately (that could help for checking mistakes).

## Create some variables to store information or for checks within loopEltonTraits$IUCN_binomial="no_match"
EltonTraits$IUCN_binomial="no_match"
EltonTraits$IUCN_binomial_issues=""
EltonTraits$IUCN_binomial_source=""
f1="test"
maxDist = 5 ## This value influences the allowed mistmatched in partial matches made with 'amatch' from the library stringdist, can be changed to allow for more or less variation in spelling. 

for (i in 1: nrow(EltonTraits)){
  ##STEP 1: matches IUCN listed species name or synonym to names in my database of interest (in this example EltonTraits)
  if (length(grep(EltonTraits$Scientific[i], IUCN$binomial))>0){
    EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(EltonTraits$Scientific[i], IUCN$binomial)]
    EltonTraits$IUCN_binomial_source[i]="IUCN"}
  if ((EltonTraits$IUCN_binomial[i]=="no_match") & (length(grep(EltonTraits$Scientific[i], IUCN$Synonyms))==1)) {
    EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(EltonTraits$Scientific[i], IUCN$Synonyms)]
    EltonTraits$IUCN_binomial_source[i]="IUCN"}
  if ((EltonTraits$IUCN_binomial[i]=="no_match") & (length(grep(EltonTraits$Scientific[i], IUCN$Synonyms))>1)) {
    EltonTraits$IUCN_binomial[i]="multiple_matches"
    EltonTraits$IUCN_binomial_issues[i]=paste(IUCN$binomial[grep(EltonTraits$Scientific[i], IUCN$Synonyms)], collapse = ';')
    EltonTraits$IUCN_binomial_source[i]="IUCN"}
  ##also try partial matches to account for possible mispellings and variants. Names would be listed as partial_match to ensure manual checking as partial matches may not be taxonomically correct.
  if ((EltonTraits$IUCN_binomial[i]=="no_match") & (!is.na(amatch(EltonTraits$Scientific[i], IUCN$binomial, maxDist = maxDist)))) {
    EltonTraits$IUCN_binomial[i]="partial_match"
    EltonTraits$IUCN_binomial_issues[i]=paste(IUCN$binomial[amatch(EltonTraits$Scientific[i], IUCN$binomial, maxDist = maxDist)], collapse = ';')
    EltonTraits$IUCN_binomial_source[i]="IUCN"}
  if ((EltonTraits$IUCN_binomial[i]=="no_match") & (!is.na(amatch(EltonTraits$Scientific[i], IUCN$Synonyms, maxDist = maxDist)))){
    EltonTraits$IUCN_binomial[i]="partial_match"
    EltonTraits$IUCN_binomial_issues[i]=paste(IUCN$binomial[amatch(EltonTraits$Scientific[i], IUCN$Synonyms, maxDist = maxDist)], paste(IUCN$Synonyms[amatch(EltonTraits$Scientific[i], IUCN$Synonyms, maxDist = maxDist)], collapse = ';'), sep = ";")
    EltonTraits$IUCN_binomial_source[i]="IUCN"}
  
  ##STEP 2: matches names not recognized in step 1 with the EoL, Encyclopedia of Life database to identify additional synomyms and possible matches
  if ((EltonTraits$IUCN_binomial[i]=="no_match") | (EltonTraits$IUCN_binomial[i]=="partial_match")) {
    EOL_synonym <- eol_search(EltonTraits$Scientific[i])
    if (!is.na(EOL_synonym[[1]][1])){
      for (j in 1:nrow(EOL_synonym)){
        f2 <-f1
        f1 <-trimws(paste(strsplit(EOL_synonym[j,2], " ")[[1]][1],strsplit(EOL_synonym[j,2], " ")[[1]][2], " ")) ##listed names are often repeated so this is to avoid rematching the same name
        f1 <- gsub("\\(|\\)", " ", f1) ##removing problems with invalid symbols in the synonym list
        if (f1!=f2){           
          if (length(grep(f1, IUCN$binomial))>0) {
            if (EltonTraits$IUCN_binomial[i]=="partial_match"){
              EltonTraits$IUCN_binomial_issues[i]=""
              EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(f1, IUCN$binomial)]
              EltonTraits$IUCN_binomial_source[i]="EOL-Taxize"}
            if ((EltonTraits$IUCN_binomial[i]!="no_match") & (EltonTraits$IUCN_binomial[i]!=IUCN$binomial[grep(f1, IUCN$binomial)])) {
              if (EltonTraits$IUCN_binomial_source[i]!="EOL-Taxize"){
                EltonTraits$IUCN_binomial[i]="multiple_matches"
                EltonTraits$IUCN_binomial_source[i]= paste(EltonTraits$IUCN_binomial_source[i], "EOL-Taxize", sep = ";")}
              if (EltonTraits$IUCN_binomial_issues[i]==""){
                EltonTraits$IUCN_binomial_issues[i]=paste(EltonTraits$IUCN_binomial[i], paste(IUCN$binomial[grep(f1, IUCN$binomial)], collapse = ';'), sep = ";")
                EltonTraits$IUCN_binomial[i]="multiple_matches"}
              if (EltonTraits$IUCN_binomial_issues[i]!="") {
                EltonTraits$IUCN_binomial_issues[i]=paste(EltonTraits$IUCN_binomial_issues[i], paste(IUCN$binomial[grep(f1, IUCN$binomial)],collapse = ';'), sep = ";")
                EltonTraits$IUCN_binomial[i]="multiple_matches"}
            }
            if (EltonTraits$IUCN_binomial[i]=="no_match"){
              EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(f1, IUCN$binomial)]
              EltonTraits$IUCN_binomial_source[i]="EOL-Taxize"}
          }
          if (length(grep(f1, IUCN$Synonyms))==1){
            if (EltonTraits$IUCN_binomial[i]=="partial_match"){
              EltonTraits$IUCN_binomial_issues[i]=""
              EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(f1, IUCN$Synonyms)]
              EltonTraits$IUCN_binomial_source[i]="EOL-Taxize"}
            if ((EltonTraits$IUCN_binomial[i]!="no_match") & (EltonTraits$IUCN_binomial[i]!=IUCN$binomial[grep(f1, IUCN$Synonyms)])) {
              if (EltonTraits$IUCN_binomial_source[i]!="EOL-Taxize"){
                EltonTraits$IUCN_binomial_source[i]= paste(EltonTraits$IUCN_binomial_source[i], "EOL-Taxize", sep = ";")
                EltonTraits$IUCN_binomial[i]="multiple_matches"}
              if (EltonTraits$IUCN_binomial_issues[i]==""){
                EltonTraits$IUCN_binomial_issues[i]=paste(EltonTraits$IUCN_binomial[i], paste(IUCN$binomial[grep(f1, IUCN$Synonyms)],collapse = ';'), sep = ";")
                EltonTraits$IUCN_binomial[i]="multiple_matches"}
              if (EltonTraits$IUCN_binomial_issues[i]!="") {
                EltonTraits$IUCN_binomial_issues[i]=paste(EltonTraits$IUCN_binomial_issues[i], paste(IUCN$binomial[grep(f1, IUCN$Synonyms)],collapse = ';'), sep = ";")
                EltonTraits$IUCN_binomial[i]="multiple_matches"}
            }
            if (EltonTraits$IUCN_binomial[i]=="no_match"){
              EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(f1, IUCN$Synonyms)]
              EltonTraits$IUCN_binomial_source[i]="EOL-Taxize" }
          }
          if ((length(grep(f1, IUCN$Synonyms))>1) & (EltonTraits$IUCN_binomial[i]=="no_match")){
            EltonTraits$IUCN_binomial[i]="multiple_matches"
            EltonTraits$IUCN_binomial_source[i]="EOL-Taxize"
            EltonTraits$IUCN_binomial_issues[i]=paste(IUCN$binomial[grep(f1, IUCN$Synonyms)],collapse = ';')
          }
          if ((length(grep(f1, IUCN$Synonyms))>1) & (EltonTraits$IUCN_binomial[i]!="no_match")){
            EltonTraits$IUCN_binomial[i]="multiple_matches"
            EltonTraits$IUCN_binomial_issues[i]=paste(EltonTraits$IUCN_binomial_issues[i], paste(IUCN$binomial[grep(f1, IUCN$Synonyms)],collapse = ';'), sep = ";")
            if (EltonTraits$IUCN_binomial[i]!="partial_match"){
              EltonTraits$IUCN_binomial_source[i]="EOL-Taxize"}
            if (EltonTraits$IUCN_binomial_source[i]=="IUCN"){
              EltonTraits$IUCN_binomial_source[i]="IUCN-partial_EOL-Taxize"}
          }
        }
      }
    }
  }
  
  ##STEP 3: matches names not recognized in step 2 with the synomym list from website Mammal Planet
  if ((EltonTraits$IUCN_binomial[i]=="no_match") | (EltonTraits$IUCN_binomial[i]=="partial_match")){
    for (m in 1:ncol(Mammal_planet)){
      if (length(grep(EltonTraits$Scientific[i], Mammal_planet[,m]))>0){
        syno = Mammal_planet[grep(EltonTraits$Scientific[i], Mammal_planet[,m]),-m]
        for(z in 1:ncol(syno)){
          if ((syno[z])!="") {
            if (length(grep(syno[z],IUCN$binomial))>0) {
              if (EltonTraits$IUCN_binomial[i]=="partial_match") {
                EltonTraits$IUCN_binomial_issues[i]=paste("partial_IUCN_match", EltonTraits$IUCN_binomial_issues[i],sep = ";")
                EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(syno[z],IUCN$binomial)]
                EltonTraits$IUCN_binomial_source[i]="Mammal_Planet_Website"}
              if ((EltonTraits$IUCN_binomial[i]!="no_match") & (EltonTraits$IUCN_binomial[i]!=IUCN$binomial[grep(syno[z],IUCN$binomial)])){
                EltonTraits$IUCN_binomial_issues[i]=paste(EltonTraits$IUCN_binomial[i],paste(IUCN$binomial[grep(syno[z],IUCN$binomial)],collapse = ';'),sep = ";")
                EltonTraits$IUCN_binomial[i]="multiple_matches"
                if (EltonTraits$IUCN_binomial_source[i]!="Mammal_Planet_Website"){
                  EltonTraits$IUCN_binomial_source[i]=paste( EltonTraits$IUCN_binomial_source[i], "Mammal_Planet_Website",sep = ";")}
              }
              if (EltonTraits$IUCN_binomial[i]=="no_match"){
                EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(syno[z],IUCN$binomial)]
                EltonTraits$IUCN_binomial_source[i]="Mammal_Planet_Website"}
            }
            if (length(grep(syno[z],IUCN$Synonyms))>0) {
              if (EltonTraits$IUCN_binomial[i]=="partial_match") {
                EltonTraits$IUCN_binomial_issues[i]=paste("partial_IUCN_match", EltonTraits$IUCN_binomial_issues[i],sep = ";")
                EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(syno[z],IUCN$Synonyms)]
                EltonTraits$IUCN_binomial_source[i]="Mammal_Planet_Website"}
              if ((EltonTraits$IUCN_binomial[i]!="no_match") & (EltonTraits$IUCN_binomial[i]!=IUCN$binomial[grep(syno[z],IUCN$Synonyms)])) {
                EltonTraits$IUCN_binomial_issues[i]=paste(EltonTraits$IUCN_binomial[i],paste(IUCN$binomial[grep(syno[z],IUCN$Synonyms)],collapse = ';'),sep = ";")
                EltonTraits$IUCN_binomial[i]="multiple_matches"
                if (EltonTraits$IUCN_binomial_source[i]!="Mammal_Planet_Website"){
                  EltonTraits$IUCN_binomial_source[i]=paste( EltonTraits$IUCN_binomial_source[i], "Mammal_Planet_Website",sep = ";")}}
              if (EltonTraits$IUCN_binomial[i]=="no_match"){
                EltonTraits$IUCN_binomial[i]=IUCN$binomial[grep(syno[z],IUCN$Synonyms)]
                EltonTraits$IUCN_binomial_source[i]="Mammal_Planet_Website"}
            }
          }
        }
      }
    }
  }
}

FOURTH: quick summary of issues. Some species may be only partially matched or matched to more than one name. In those cases we see no solution by human intervention. That means, you need to check those taxa and make a decision (to assign one name, to ignore these potential confusion species, to write a paper to clarify taxonomy…)

EltonTraits %>% filter(IUCN_binomial %in% c("partial_match","multiple_matches")) %>%
count(IUCN_binomial)

The R script available on GitHub  has a possible STEP 4 that could be add to the loop or run afterwards to match names with ITIS, Integrated Taxonomic Information System. This step was not helpful to match these example databases but may be useful for others.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s