Mind the Gap

Mind the Gap

Dealing with missing values

Why data series sometimes contain gaps – or are missing altogether – and how to deal with it

Why are there gaps?

It is said that the phrase ‘mind the gap’ originated in 1968 when it was introduced as a warning to passengers on the London Underground. A number of novels, films and marketing manuals have since borrowed the idea. The gaps in question have ranged from the physical to the philosophical and social. This piece talks about gaps in market data, which lead to imprecision and uncertainty for dairy operators. There are several reasons why these gaps in data series exist:

  • The degree to which industries, sectors or whole countries record data varies enormously. Figures for production, trade and consumption are sometimes hard to find, especially in some of the biggest consumers of imported dairy products, such as China and North Africa.
  • Not all data are available online in a convenient form. Some may be embedded in reports, tables or articles, or restricted to hard copies and printed documents.
  • Production and pricing information is often considered commercially sensitive, particularly in sectors such as ingredients, where there may be a small number of large operators.
  • Sometimes the data exist but potential users don’t know where to look.

How do we fill them?

At DatumLocus, we try to ensure that our data collection is as complete and as automated as possible. But sometimes we supplement this with manual tools and use creative methods to extrapolate and generate missing values based on the data we have. Examples are:

  • Although they may sound unpleasant, web crawling and scraping techniques are employed widely to obtain market information which would otherwise be difficult to access.
  • We convert all of our raw data into milk equivalent and/or the main constituent parts of milk (fat, protein, lactose). We use these ‘processed’ data in many elements of our analysis. In the case of missing data, this allows us to be more precise when identifying and filling gaps.
  • In instances where it is unrealistic to expect figures to be available on a weekly or monthly basis (e.g. consumption volumes split by retail, foodservice and industrial), we conduct market mapping, which gives a snapshot of a particular market based on the data available.
  • We frequently find ourselves ‘thinking outside the box’ to come up with new solutions.

The case of Irish cheese

An example of missing data is monthly production of cheese in Ireland. Since January 2021, figures have been unavailable until well after the end of the current year (e.g. in 2022 for 2021 values).

Monthly data are available for domestic milk deliveries and for production of drinking milk, cream, butter and SMP (the latter with gaps for some months). To make things more difficult, volumes of imported milk for processing, which can account for as much as 10% of total deliveries to dairies, are unavailable, as are monthly data for WMP, FFMP and yogurt production.

In spite of this, we have been able to derive values for monthly cheese production from the data available using a variety of techniques. Below are the actual and derived figures for monthly Irish cheese production in 2021. Our cumulative accuracy rate for the year was 98%.

Irish Cheese Production
Irish Cheese Production

It’s not rocket science …

… it’s data science. Companies are using it more and more. Some of it is not especially complicated – Irish cheese is one such example – but some of it is. DatumLocus can help fill in the gaps and provide complete datasets, all in one place, available at the click of a mouse or the touch of a screen.