The Metropolitan Museum of Art in New York City

End-to-End Image Recognition With Open Source Data — Part 1: Data Acquisition & Model Training

11 min readMay 5, 2021

Exploring the Metropolitan Museum of Art’s treasure trove of art images, and predicting where a painting come from.

Due to the Covid-19 pandemic, many museums around the world have shut their doors, with the Metropolitan Museum of Art in New York being no exception. “The Met”, as the museum is commonly called, presents an extensive collection spanning 5,000 years of art history across its two locations, at 5th Avenue in the Upper East Side and at the Met Cloisters. It’s collection is much larger than the pieces that are displayed to the public each year, comprising hundreds of thousands of precious artefacts.

For those interested in exploring more of the museum’s collection, or who are missing the Met during the pandemic, the museum has an extensive list of data on its open API.

A painting contained in the dataset, seen inside the Met. Photo taken by the author.

About the API

The Met’s Art Collection API gives access to a treasure trove of data about the museum’s collection. For a given piece in the collection, data about the art work such as its category, its title, the region from which it comes, which culture it belongs to, and even a year or timeframe when the art piece was made.

Perhaps even more exciting is that many of the pieces also contain an image of the piece, stored as a link as part of the metadata of each object.

About the project

The Met’s easy to use API represents a great opportunity to try out image recognition techniques using open-source data.

In this series (part 2 here), we will walk through how to use open source data like that obtained via the Met’s API and how to prepare the images for automated image recognition using a convolutional neural network. We will then discuss training this model for predicting which culture a painting is from.

Finally, in the next post in the series, we will show how to deploy the painting culture prediction model as an interactive dashboard, where users can try out the classifier for themselves.

The goal is to show just how easy it is to create a working machine learning project for image recognition using open-source data that is freely available on the web, and free and open-source deployment tools.

Accessing the Data with the API

The first step in this project is to of course access the data that’s available via the Met’s API.

The API is very simple and easy to use: it doesn’t require any registration or access token. All of the objects in the Met’s collection are organized via a key called objectID. The first API endpoint, called “Objects”, simply returns a list of all possible objectIDs for every possible object. Next, using this list, we can call the next API endpoint called “Object”, which takes an individual object’s objectID and returns the data about that object.

Finally, to get the images for the objects, we make a request call to the link saved under the data field called “primaryImageSmall” in each object’s data. If a link exists, we can use it to download the .jpg file it links to.

This process of accessing the data is shown in the code below:

As with web scraping, accessing an API can sometimes involve handling errors. In the code to access the data, we can see that issues appeared and had to be dealt with, mostly with conditional statements and exceptions. For example, not all items have an image associated with them. Additionally, not all URL links to the images worked. Dealing with these issues with a new dataset always involves some trial and error.

Exploring & Analyzing the Data

Now that we’ve downloaded all of the Met’s data from their free API, it’s time to take a look at what we have.

In total, we acquired data for about 130,000 objects with images. In addition to the object images, the data also contains 57 columns of metadata about the artwork. The table below shows the data found in these columns for a given example artwork, a print from Japan:

Example artwork, “Four Friends of Calligraphy: Lady Komachi”

We can see that for a given artwork, only some of the metadata columns contain data.

However, looking through the metadata columns, we see the many options for potential target variables for future machine learning models we could train using this data. For example, we could build a classifier for the type of object, or for its country of origin/culture, or its time period or epoch.

Before deciding on what our target variable will be, we start by directly visualizing the images to get a feel for the mix of artwork included in the collection. This will also help us get a feel for the relative quantity and quality of data available for predicting a given target variable. For example, in the metadata for the example image, we saw that the fields for country and region were blank. If this is also true for many other pieces in the collection, that variable may not be the best choice to use as a target variable.

Our first visualization is a grid plot of a random sampling of the image data, shown below.

As we can see, the collection contains many different types of artwork, from paintings and drawings to ceramics, fashion, and furniture. Furthermore, most of the images are in color, but some images are black and white. The code for creating the grid of images is shown below.

Next, we explore the composition of our data. Our goal is to answer questions like: How many works of art do we have from each culture? How many paintings are in our dataset? How many drawings? How many sculptures? What medium is most commonly used?

To solve these questions, we’ll create some basic analysis plots. In the first plot, we look at the various types of artwork in the collection.

In this initial visualization, we can see that the dataset contains a good balance of many different types of artwork. We also see that some artwork types have multiple names or aliases, like “Metalwork”, “Metal”, and “Metal-Ornaments”. This will be something to keep in mind later when training models.

Given that there were over 450 different categories of artwork, we decided that this would be too many to try to classify, especially given that many of the categories did not have many samples. Instead, we chose to take just one artwork type, Paintings, and try to predict what country the painting is from. The country is actually held in the variable called “culture”.

The next visualization shows the number of paintings we have for each country.

We see that the most paintings come from China or America, with Japan in third place. There are also many paintings from various regions in India.

In the end, we decided to train a classification model to classify images from the top 4 countries with the most sample paintings: China, Japan, America and India. For India, we grouped together all paintings that included the word “India” in the culture name.

Next, we wanted to see if we could recognize a difference between paintings from the various cultures with our “naive” human eyes. To do this, we created an image grid with paintings from each of the 4 cultures. These grids are shown below.

Example paintings for the “American” culture

Example paintings for the “Indian” culture

Example paintings from the “Chinese” culture

Example paintings from “Japanese” culture

It’s fascinating to see the difference in styles from the 4 cultures in our trainings dataset. The colors used in the paintings from India, for example, are quite distinct, and they often contain a border around the painting. The American paintings tend to contain many portraits and landscapes. The Chinese and Japanese paintings tent to include more minimalistic nature scenes, for example showing one or two flowers or birds. Many paintings from both cultures also include a calligraphy element to one side of the painting. However, we agreed that it can be difficult for our untrained Western eyes to immediately recognize if a painting is from Japan or China. It will be very interesting to see if our model is able to recognize the difference better than we can!

Now that we have visualized our data and decided on our target variable of painting culture of origin, it’s time to start preparing our data for modeling.

Preparing the Data for Modeling

In order to do image classification, we have to convert our training data images into numbers that the model can read and understand. To do this, we will use the open-source package OpenCV.

OpenCV contains tools and functions for completing virtually every pre-processing step needed to get our image data ready for modelling. Even better, most of these commands can be strung together in just one line of code. This can be seen in the function below.

Let’s walk through what the various parts of this code are doing.

First, we define the new size of the images. The raw images are of all different sizes, but ML models require the dimensions of the input data to remain consistent. Here, we’ll reshape all images to a 150x150 pixel square. This is done with the function cv2.resize(). Note that we pass these defined rows and columns as the second argument to that function. The first argument is the output of another openCV function, cv2.image. This function reads in an image. We use the parameter cv2.IMREAD_COLOR to tell openCV that the image is a color image, not a black and white image. The final argument we pass to cv2.resize() is interpolate. This is just how the resizing is done. You can read more about the various options for this parameter here.

This string of openCV functions returns a 3-dimensional numpy array of the shape (150, 150, 3). This means that it contains 3 matrices, each 150x150. Each of these 3 contains data for one color channel, red, green and blue (RGB). We append each of these 3-D arrays to an array of our training data, called x. At the same time, we add the label of the data to an array of labels called y.

Now that we have the data re-sized and saved, we can apply another pre-processing step that augments and enhances our data by slightly changing the images through zooming, shifting the image, etc. This is done with a class from Keras called ImageDataGenerator(). What this class does is to apply these small changes to the training images, generating some additional data in the process. The best part about this class is that its output can be directly fed into our ML model as an input.

The code for applying the image data generator is shown below.

Notice that we need 2 different generators: one for generating additional training data, and the other to use on the holdout test data. The test data generator only applies the rescaling to the image pixels and does not generate any additional data. Also note that in the testing data generator, we change the batch size to 1. This is so that the generator only returns one image at a time and for keeping the images aligned with our prediction labels.

Model Set-up and Training

Now that we have prepared our training and test data generators, it’s time to set up and train our model!

To build our model, we will use the popular deep learning framework keras. Keras is built on top of TensorFlow, and is meant to provide an easier API to work with than native TensorFlow.

After choosing our deep learning framework, the next step is to choose a model architecture to implement. In deep learning, model architecture refers to the number, shape, and kind of layers which combine to form the neural network. There are almost infinite combinations of architectures, and new papers come out almost daily which propose new ones. For this project, we’ve gone with a common architecture for image recognition, which can also be seen in this blog post, and is quite similar to this one as well, just omitting the dropout layers.

Note that the final layer of the network has an output size of 4: this is chosen because we have 4 potential categories which can be predicted, our 4 cultures of Japan, America, China and India. Each of the output nodes represents the probability of the image belonging to one class. If this were a binary classification problem, like spam email classification, our final layer would have an output size of 2, where one output represents the probability that the email belongs to the positive class (is spam), and the other representing the probability that it does not belong to the positive class (not spam).

Similarly, we use the activation function of softmax at the end, as this is the most common for multi-class classification problems. For binary classification problems, we would change this to sigmoid.

We chose categorical_crossentropy as the loss function since this is a multi-classification problem. This great blog post gives more information about choosing the loss function for other types of problems to which you are applying deep learning.

Using this rather boiler-plate architecture, we were ready to start training our model. We began with just 50 training epochs, but then increased this to 100 when it looked like the model was still learning after 50 epochs.

The graph below shows how training and validation accuracy improved over the training period.

We can see that by the end of training, the validation accuracy was hovering around near 80%.

Wrapping it up

What have we accomplished so far? We analyzed the data downloaded using the Met’s API and decided on a prediction problem using the data. We decided to limit ourselves to only the painting images, and to predict which of 4 cultures the paintings belong to: China, Japan, India or America. We then pre-processed the images and fed them into a convolution neural network with 4 output nodes, one for each category.

Now that we have a trained model, the next step is to make the model available to predict on new data. This process is called deployment.

In the next blog post, we’ll show how to build a simple front-end dashboard and deploy our trained machine learning model as a live backend application.