More than five lessons

Enter machine learning and the capacity to collect large amounts of data and all of a sudden the computer science and cultural heritage communities seem to be arriving at the same point.  Is what we are doing and how we are doing it ok?  

Well, no not really, in my view, but there are a few small lights ahead.  

The article by Eun Seo Jo and Timnit Gebru “Lessons from archives: strategies for collecting sociocultural data in machine learning” is a good start in acknowledging that this the right time to bring different disciplinary viewpoints and practices together to understand how to appropriately apply machine learning to large document collections, especially documentary history collections in cultural heritage.  The authors coin the phrase “sociocultural machine learning” and outline five lessons for the machine learning community to integrate into practice that are drawn from archival practice: consent, power, inclusivity, transparency and ethics and privacy.  

There is much to be gained from interdisciplinary knowledge exchange.  In dialogue with a colleague (Alexis Tindall) on this article, she reflected that the case for advancing professional ethics as part of machine learning practice is well made.  She asked a couple of good practical questions: why not suggest archivists fill roles on machine learning teams? why couldn’t archives be the home of datasets?  Yeah, why not!?  

The questions I seek to ask and aim to help with answering in support of data curation as part of the AI4LAM community are (roughly):

  • Going deeper into epistemology, theory and methodology (how and where can and do the very different viewpoints and practices coalesce) 
  • Getting known assumptions and biases out and on the table (at least they are the known knowns!)   
  • Identifying models where complex and collective tensions are navigated (e.g. the five safes framework and the CARE principles for data governance)  
  • Identifying new approaches to tackling work together e.g. radical collaboration (see the work of Nina Simon and Nancy McGovern
  • Identifying areas of shared interest, value and community partners to work on that change process with (e.g. logic investment maps for guiding collaboration)  

Curatorial determinations about what to collect and how to collect is getting some recognition as part of the maturing of data science.  Great!  The curatorial community definitely have experience in tackling complex sociocultural issues.  But, these social practices can also perpetuate existing privilege and prejudice, and these need to be resolved and not compounded with the introduction of machine learning.  

After having looked at the accumulations of data and the computer science literature associated with the TRECVID conferences (information retrieval with video collections) I struggled with the following (just for starters):

  • Decontextualisation and exceptionalism
  • Data collection for experimentation versus representativeness 
  • Lack of documentation and clarity on scope
  • Cartesian and positivist ways of thinking  
  • Impacts of generalisation and homogeneity  

“Mere accumulation” is a way I describe indiscriminate collection, when materials get brought together by accident, convenience or even means-end type missions.  This is the view I bring to the data collection practices being employed in service of machine learning – as they stand presently.  It’s a brutal assessment, and it is no doubt in need of challenging. So it would be great to learn of exemplary practices to have my eyes and mind opened.  

As a starting point though, I welcome the approach outlined by the authors of this article, particularly where the intent is to make cultural heritage more accessible.  It seems reasonable to assume that there will be more than five lessons that come out of stronger interdisciplinary partnerships.     

*Thanks go to Nicole Coleman, who asked me for my thoughts on this article. Definitely worth having a read of her article: Managing bias when library collections become data.