melbalabs journal

practical mysticism

31 Jul 2020

data cleanup misadventures

There's always a big initial push to bring a system up and have it create and accumulate data, but usually the decisions how to manage, archive and dispose of the data are delayed and avoided. We can always figure it out later, right?

Delaying deletes to some later stage in the future has multiple reasons and perspectives

  • product - it's not yet obvious how long the data will be needed. For example discussions and integrations with a client are still ongoing and product requirements have to stabilize.
  • legal - even if product doesn't need the data, legislation requires you to have it for audits and government reporting
  • analytics - different internal teams use the accumulated data to understand market movements, change direction of products and make new products
  • technical - the dev team might not care about deleting data, because it doesn't impact quality of service
  • technical - keeping data allows reprocessing it in newer versions of the system and for backup in case of bugs
  • financial - there's no impact of keeping the data, because storing it doesn't cost much
  • financial - implementing data management actually is expensive, because you have to develop, test, deploy, monitor and maintain the extra features. This causes you to lose flexibility when you have to change project direction and implementation, because you have to pay the price again.
  • self-interest - apply the classic strategy of debt accumulation - it's not going to be my problem to fix it in the future and I don't care
  • self-interest - I'd have to talk to 20 different people to make sure nobody is affected by the destructive operation and in the end might still get blamed for destroying data
  • business - there are structural and process changes being implemented that will make all operations more efficient eg. by reducing the number of outages and fires to put out.
  • financial - customers will be onboarded and you'll be able to throw more money at the problem, so you don't cut as many corners

It's very hard or impossible to correctly weight those choices against each other and the back of the envelope calculation usually leads to inaction by ignoring the problem and focusing on more immediate issues.

Solving the problem now has the advantage that you know your exact current state. Postponing the problem is a bet that the future will be better than the present. As time goes by technical staff and management might lose specifics how the system operates - documentation is incomplete, outdated, nonexistent. A key person might leave. As more and more corners are cut, you'll have bigger and bigger fires and you might never be able to come back and fix the system and it will fail. There's never time to do it right, but always time to do it over.