Data analysis made simple
While roaming around looking for data to explore I came across this dataset in Kaggle website. The data set contains information about the animals admitted in the shelter and the purpose is to predict their outcome.
But before we get to that let’s explore the files and get to know the features.
Warm up
To read the csv files we need the library readr
If you don’t have the library available you need to install it
Now let’s read the files and have a look at what we have.
We can also check the dimension of the dataset
26729, 10
Some processing to convert some columns to factors, since we have many of them we’ll use the magic lapply.
Know your data
I will proceed with some explorations to get to know the kind of information the dataset possesses.
Summary is a very useful to check basic information about the data frame.
It also shows that we hve some NA, “Other”, “Unknown” values which might be a problem to get relevant statistical results and machine learning models.
That’s why I will start by some data observation and processign when needed to have the table more harmonised.
Age
The first observation is tht the age is expressed in various “units”:
I counted 44 units, in order to be able to use this information it should be expressed with the same unit, I chose the smallest unit existing which is “day”. for each row I will apply a transformation by converting “week”,”month”,”year”,and “day”, First the function is defined
then I apply the conversion function to each row of the column age
Dogs vs Cats
Here we will explore the correlation between the fate of the animal and its type.
From the plot, it seems that the animal type impacts somehow the outcome. To make sure this sample is not skewed by dominance of one type over another let’s check first the distribution
The distribution is not perfectly balanced because Dogs represent 58% of the animals. The outcome depends on the animal type
Male vs Female
In this part we are more interested in the gender, which in this case seems to be divided in 4 types:
And “Unknown”, that can be any of the other 4.
The challenge is to fill the unknown with the right values : female (spayed, or intact), male (neutered, or intact), we will use basic knowledge as well as the other features.
The first thing to try is to infer the gender based on color using the following golden rule:
*For genetical reasons, only females are calico, which means they have three colors (white, orange and black), they can happen to be male, but this means they have a genetic anomaly (XXY chromosomes), but I won’t go that far.
Some numbers to prove my point:
when I display the count of cats that are “Calico” per gender, out of more than 400 cats, only 4 of them are male. Therefore the assumption that the 19 unknown are female is not harm the statistics. Now the main problem remains what kind of female ? Intact or spayed ?
At first, I can derive some intuition: the animals are born intact, and are spayed/neutered at some point in their life which should not happen before some age, for example a cat who is 1 week is too young to be spayed and vice versa, an old cat is more likely to have been spayed already, let’s plot age as a function of gender to verify this theory.
Until the age of 30 days, the neutered/spayed animals are inexistent, which makes sense from a scientific point o view because the animals are too young. The threshold I will use is 30.
The following conclusion is drawn
The Calico cats under the age of 30 days are all intact females
Let’s apply it
The other part of data with unknown gender could be useful to predict the outcome of the animal
For example, animal of unknown type will never be adopted.
Feature engineering
Now let’s move on to create new features.
HasName
To simply the processing of name, the characters themseves are not useful in the context of learning and outcome prediction. However the presence is important. It means most of the time that the animal belonged to somebody how it a certain name.
That’s all for today. See you next time!
Ciao!