Data is the life of machine learning, especially deep learning. Here, we use the IMDB-WIKI dataset that can be used to learn the task of estimating age and gender from facial images. introduce. In this article, even the shaping of data for learning. Next time, I would like to study age / gender estimation CNN using CNN.
The code is below. https://github.com/yu4u/age-gender-estimation
This dataset is a database created by crawling Internet Movie Database (IMDb; online database of actors in movies and TV shows) and Wikipedia, with profile images, images of facial areas extracted from profile images, and metadata about people. Consists of. IMDb contains 460,723 facial images and Wikipedia contains 62,328 facial images. This dataset was used as a pre-learning dataset by the winning team in the competition to estimate age from the image ChaLearn apparent age estimation challenge. Fine-tuning with competition training data)
The original image is very large, so download an archive of images and metadata with extracted facial areas.
#IMDb data
wget https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/imdb_crop.tar
#Wikipedia data
wget https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/static/wiki_crop.tar
All the metadata is stored in the .mat
file (wiki.mat
for Wikipedia) in the above archive. Although it is Matlab data, it can be read as follows with scipy
.
meta = scipy.io.loadmat("wiki.mat")
The data included is as follows.
--dob
: Date of birth
--photo_taken
: The year the photo was taken
--full_path
: Image path
--gender
: Gender
--name
: Name
--face_location
: Rectangle information of face area
--face_score
: Face detection score
--second_face_score
: Second largest face detection score
These can be obtained as follows.
full_path = meta[db][0, 0]["full_path"][0]
dob = meta[db][0, 0]["dob"][0]
The age of the person in the image is not directly included in the metadata and needs to be calculated. As a policy, the date of birth (dob
) should be subtracted from the shooting date and time (photo_taken
), but the following detailed measures are required.
First, dob
is a value formatted to Matlab serial date number
and cannot be used unless it is converted to time information. Python has the ability to handle similar formats and can be converted, but Matlab is defined as the number of days since January 1, ** 0 ** AD, whereas Python ** 1 ** AD. [There is a trap] that is defined as the number of days from January 1, 2011 (http://sociograph.blogspot.jp/2011/04/how-to-avoid-gotcha-when-converting.html) (What That's right). Therefore, it is necessary to convert as follows (Year 0 is a leap year!)
python_datetime = datetime.fromordinal(matlab_serial_date_numer - 366)
Also, since photo_taken
has only information on the year, if the date and time in the middle of the year is used as the expected value, the age is calculated as shown below.
def calc_age(taken, dob):
birth = datetime.fromordinal(max(int(dob) - 366, 1))
# assume the photo was taken in the middle (Jul. 1) of the year
if birth.month < 7:
return taken - birth.year
else:
return taken - birth.year - 1
Since the data provided is basically crawling data, it contains a lot of noise data and needs to be removed. For example, in the case of Wikipedia database, the following data is included.
--There are 18016
cases where face_score
is -inf
.
--There are 4096
cases where second_face_score
is not nan
. If there are more than one face, the metadata may be for second_face
--The calculated age is negative or greater than 100
Therefore, it is necessary to extract only those whose face_score
is above a certain level, whose second_face
is nan
, and whose age is in the range of 0 to 100.
Finally, let's take a look at the contents of the Wikipedia dataset. Details are below. https://github.com/yu4u/age-gender-estimation/blob/master/check_dataset.ipynb
Distribution of face_score
. Excluding -inf
.
Distribution of second_face_score
. Excluding nan
.
An example of an image with face_score
greater than 5
An example of an image with face_score
from 0 to 1
An example of an image where face_score
is -inf
Examples of images with calculated age greater than 100
An example of an image with a negative calculated age