Cleaning data files, post-London, pre-Pretoria

Cleaning data files, post-London, pre-Pretoria
We are now in South Africa and have just started work on completing our research on the Forbes collection, one of the family collections with a very long temporal timespan. What we are currently recording into our databases is ‘dirty’ in the sense that the information contained in it is being recorded under pressure in the archive, and then goes through a number of levels of cleaning after the archival work is done.

Cleaning? It would seem that domestic labour is an aspect of almost all activity to do with archival research! Since our intensive period of fieldwork at SOAS on the LMS Collection, the three weeks between returning  and then getting ready for South Africa were filled with cleaning database files, a job carried out mainly me, backed by Emilia, while Sue is not involved in this aspect of the project. So what does this ‘cleaning’ mean, what is involved, readers may wonder? There are lots of individual things that need to be done, but the three main aspects are outlined below.

The first is to ensure meta-data consistency. In the database records, no matter what the complexities of the actual documents being recorded, each of them has a number of items of meta-data which need to be specified, such as names, addresses and of course dates. It is important that these are always entered in the same completely consistent way, and also that they are entered correctly, for computer systems and how they deal with database information will choke on anything else. This means among other things checking any dodgy or incomplete dates and names against JPEGs if digital photographs have been taken. Sounds easy and quite straightforward? Try doing it over 1000 times, trying to remember exactly how names need to be recorded,  whether a particular place this but one way or another, and figuring out which set of JPEGs relates to which parts of which file!

The second key aspect concerns what is recorded about the letters or other documents regarding their contents. There needs to be no mistakes made in extracts or quotations that have been made from letters, and summaries of the contents have to be written in a way that both provides information and does so in a way that is understandable to third parties, to later readers of them, so nothing elliptical or that relies on information in another record can appear.

The third aspect is that what I refer to as the ‘naughty words’ in race terms – words which might have been innocent when first used, but which over time have come to have considerable racial connotations or racist meanings in South Africa – have to be copied from the summary or extracts and quotations and then placed in a special part of the database. This will eventually enable tracking usage of these over time across all of the letters that will appear in the published Whites Writing Whiteness website.

But here we are now, having left cleaning behind for a six week period in which new components of the WWW database are being recorded and presently look a bit messy and rather inelegant. But they will scrub up nicely at a later stage. For those of a romantic disposition, working in archives will always have the edge. But without that domestic labour of archival work called file cleaning, there would be no large-scale body of data to underpin not only WWW analysis concerning the unfolding of racial forms of representation, but also support secondary analysis by many other researchers as well!

The next blog will take the form of comments on our unfolding work in the National Archives in Pretoria, which is looking beautiful in late spring.

Last updated: 10 November 2016


ESRC_50th-ANNIVERSARY-LOGO-RGB-blue-white-gold

Recent Posts