The data set I used for this project is the official collection data published on the official M+ GitHub page. The data includes basic metadata for each artwork, including accession number, title, date, classification, medium, and dimensions for both the M+ Permanent Collection and the M+ Sigg Collection (mostly donated by the Swiss collector Uli Sigg in 2012). The metadata came with 3 separate csv files: artworks records of M+ Permanent Collection, artworks records of M+ Sigg Collection and the maker’s information of all the collected works. Images of the corresponding artwork are not provided due to the copyright policy.
To clean up the data for analysis, the two artwork records files was joined and all the archival records were deleted since they are not relevant to this analysis. Since the original data gave each art work maximum three categories and maximum 2 areas, and they are stored as a series of arrays. Only the primary category/area was kept for the ease of analysis by eliminating the second and/or third record of the array. The maker's information (name, year of birth nationality, and gender) was concatenated to the artwork data frame using the constituent code as index in order to add the variables to the analysis attributes.
One thing worth noticing is that M+ is still adopting the binary gender system when labeling the artists, or all the artists of the collected works self identified as men or women, thus the data set only has 3 gender labels (male, female and NA for groups/studios).
The finalized data frame contains 7,373 records of collected artworks.
The analysis and visualization are performed mainly with Python’s pandas and Plotly libraries, which are two powerful tools for data analytics and visualization of large-scale datasets.
The Pandas library was mostly employed for the data cleaning process since the metadata is partially unstructured, in order to shape it with necessary attributes for this analysis, a lot of the columns were filtered out. Multiple new columns were added to fulfill the analysis need.
The Plotly library is the main tool for creating the pie chart, choropleth map and histograms to visually represent the collection data. The raceplotly library is used to create a bar race plot showing the count of each gender by birth year of the artists.