Data analysis made simple

While roaming around looking for data to explore I came across this dataset in Kaggle website. The data set contains information about the animals admitted in the shelter and the purpose is to predict their outcome.

But before we get to that let’s explore the files and get to know the features.

Warm up

To read the csv files we need the library readr

library("readr")

If you don’t have the library available you need to install it

Now let’s read the files and have a look at what we have.

animals <- read_csv(file="train.csv")
head(animals)

##   AnimalID    Name            DateTime     OutcomeType OutcomeSubtype
## 1  A671945 Hambone 2014-02-12 18:22:00 Return_to_owner           <NA>
## 2  A656520   Emily 2013-10-13 12:44:00      Euthanasia      Suffering
## 3  A686464  Pearce 2015-01-31 12:28:00        Adoption         Foster
## 4  A683430    <NA> 2014-07-11 19:09:00        Transfer        Partner
## 5  A667013    <NA> 2013-11-15 12:52:00        Transfer        Partner
## 6  A677334    Elsa 2014-04-25 13:04:00        Transfer        Partner
##   AnimalType SexuponOutcome AgeuponOutcome
## 1        Dog  Neutered Male         1 year
## 2        Cat  Spayed Female         1 year
## 3        Dog  Neutered Male        2 years
## 4        Cat    Intact Male        3 weeks
## 5        Dog  Neutered Male        2 years
## 6        Dog  Intact Female        1 month
##                               Breed       Color
## 1             Shetland Sheepdog Mix Brown/White
## 2            Domestic Shorthair Mix Cream Tabby
## 3                      Pit Bull Mix  Blue/White
## 4            Domestic Shorthair Mix  Blue Cream
## 5       Lhasa Apso/Miniature Poodle         Tan
## 6 Cairn Terrier/Chihuahua Shorthair   Black/Tan

We can also check the dimension of the dataset
26729, 10 Some processing to convert some columns to factors, since we have many of them we’ll use the magic lapply.

factors <- c("OutcomeType","OutcomeSubtype", "AnimalType","AgeuponOutcome","SexuponOutcome","Breed","Color")
animals[factors] <- lapply(animals[factors],FUN =as.factor)

Know your data

I will proceed with some explorations to get to know the kind of information the dataset possesses.
Summary is a very useful to check basic information about the data frame. It also shows that we hve some NA, “Other”, “Unknown” values which might be a problem to get relevant statistical results and machine learning models.

That’s why I will start by some data observation and processign when needed to have the table more harmonised.

Age

The first observation is tht the age is expressed in various “units”:

levels(animals$AgeuponOutcome)

##  [1] "0 years"   "10 months" "10 years"  "11 months" "11 years" 
##  [6] "12 years"  "13 years"  "14 years"  "15 years"  "16 years" 
## [11] "17 years"  "18 years"  "19 years"  "1 day"     "1 month"  
## [16] "1 week"    "1 weeks"   "1 year"    "20 years"  "2 days"   
## [21] "2 months"  "2 weeks"   "2 years"   "3 days"    "3 months" 
## [26] "3 weeks"   "3 years"   "4 days"    "4 months"  "4 weeks"  
## [31] "4 years"   "5 days"    "5 months"  "5 weeks"   "5 years"  
## [36] "6 days"    "6 months"  "6 years"   "7 months"  "7 years"  
## [41] "8 months"  "8 years"   "9 months"  "9 years"

I counted 44 units, in order to be able to use this information it should be expressed with the same unit, I chose the smallest unit existing which is “day”. for each row I will apply a transformation by converting “week”,”month”,”year”,and “day”, First the function is defined

library(stringr)
convertAge <- function(age){
  # first extract the digits
  regexp <- "[[:digit:]]+"
  result <- 0
  digits <- strtoi(str_extract(age, regexp))
  if (grepl("day",age)){
    result <- digits
  } else if(grepl("week",age)) {
    result <- digits*7
  } else if (grepl("month",age)) {
    result <- digits*30
  } else if (grepl("year",age)) {
    result <- digits*365
  }
  return(result)
}

then I apply the conversion function to each row of the column age

 animals$AgeuponOutcome <- sapply(animals$AgeuponOutcome,convertAge)

Dogs vs Cats

Here we will explore the correlation between the fate of the animal and its type.

center

From the plot, it seems that the animal type impacts somehow the outcome. To make sure this sample is not skewed by dominance of one type over another let’s check first the distribution

## 
##       Cat       Dog 
## 0.4165513 0.5834487

The distribution is not perfectly balanced because Dogs represent 58% of the animals. The outcome depends on the animal type

##      
##        Adoption      Died Euthanasia Return_to_owner  Transfer
##   Cat 0.3966942 0.7461929  0.4565916       0.1044714 0.5842709
##   Dog 0.6033058 0.2538071  0.5434084       0.8955286 0.4157291

Male vs Female

In this part we are more interested in the gender, which in this case seems to be divided in 4 types:

## [1] "Intact Female" "Intact Male"   "Neutered Male" "Spayed Female"
## [5] "Unknown"

And “Unknown”, that can be any of the other 4.

center

##                
##                   Adoption       Died Euthanasia Return_to_owner  Transfer
##   Intact Female 0.01885040 0.28426396 0.25787781     0.062904911 0.2706432
##   Intact Male   0.01467174 0.40101523 0.30675241     0.099686520 0.2477181
##   Neutered Male 0.48491039 0.09644670 0.22122186     0.469592476 0.2066440
##   Spayed Female 0.48156746 0.09137056 0.14919614     0.365308255 0.1736362
##   Unknown       0.00000000 0.12690355 0.06495177     0.002507837 0.1013585

The challenge is to fill the unknown with the right values : female (spayed, or intact), male (neutered, or intact), we will use basic knowledge as well as the other features.

The first thing to try is to infer the gender based on color using the following golden rule:
*For genetical reasons, only females are calico, which means they have three colors (white, orange and black), they can happen to be male, but this means they have a genetic anomaly (XXY chromosomes), but I won’t go that far.
Some numbers to prove my point:

table(animals$SexuponOutcome[animals["Color"] == "Calico"])

## 
## Intact Female   Intact Male Neutered Male Spayed Female       Unknown 
##           198             3             1           296            19

when I display the count of cats that are “Calico” per gender, out of more than 400 cats, only 4 of them are male. Therefore the assumption that the 19 unknown are female is not harm the statistics. Now the main problem remains what kind of female ? Intact or spayed ?

At first, I can derive some intuition: the animals are born intact, and are spayed/neutered at some point in their life which should not happen before some age, for example a cat who is 1 week is too young to be spayed and vice versa, an old cat is more likely to have been spayed already, let’s plot age as a function of gender to verify this theory.

center

Until the age of 30 days, the neutered/spayed animals are inexistent, which makes sense from a scientific point o view because the animals are too young. The threshold I will use is 30.
The following conclusion is drawn The Calico cats under the age of 30 days are all intact females

Let’s apply it

animals$SexuponOutcome[animals$AgeuponOutcome <=30 & animals$SexuponOutcome == "Unknown"] <- "Intact Female"

The other part of data with unknown gender could be useful to predict the outcome of the animal

table(animals$OutcomeType[animals["SexuponOutcome"] == "Unknown"])

## 
##        Adoption            Died      Euthanasia Return_to_owner 
##               0               8              57              10 
##        Transfer 
##             322

For example, animal of unknown type will never be adopted.

Feature engineering

Now let’s move on to create new features.

HasName

To simply the processing of name, the characters themseves are not useful in the context of learning and outcome prediction. However the presence is important. It means most of the time that the animal belonged to somebody how it a certain name.

animals$hasName <- sapply(animals$Name,FUN = function(x) !is.na(x))
animals$hasName <- factor(animals$hasName)

That’s all for today. See you next time!

Ciao!