Stop data pollution from turning your company’s data lake into a swamp

Hear from CIOs, CTOs, and other executives and senior executives on data and AI strategies at the Future of Work Summit on January 12, 2022. Learn more

This article was written by Kevin Campbell, CEO of Syniti

Today, every organization is an organization of data. It doesn’t matter whether you work for a Silicon Valley tech company, an established manufacturer, a traditional financial services company, or even a government agency, your business is collecting, storing, and aiming to use more data than ever before.

Globally, we are currently in the midst of a data explosion; the total global volume of enterprise data is expected to double from 1,005 to 2,025 terabytes between 2020 and 2022. It’s no wonder that many organizations play a perpetual catch-up game, lacking the knowledge and tools to effectively manage the data they collect so that it is truly useful.

To handle this deluge of data, many businesses are turning to data lakes, instead of a standard data warehouse. In theory, data lakes give businesses the edge in scalability, flexibility, and integration with technologies like IoT. However, rather than a pristine data lake, many organizations end up with something like a stagnant data swamp, full of murky data pollution. So what can you do to avoid the swamp and get the most out of your data?

1. Choose the most important company data… and get (almost) everyone to agree

I have seven children, so as a father of course I love all of my children the same. It is not the same for the data. Stop treating all of your business data like it’s the same level of importance. Believe me, this is not the case.

You need to decide, together with some key stakeholders, which data is most important to your organization and its objectives. You can’t cover all of your data, and throwing everything in the Data Lake is the fastest way to create a swamp. So come up with the data that drives the business and delivers broader business value – driving efficiencies, improving customer experiences, informing product development – and designate it as your KPIs and metrics for success.

Once you have these KPIs and the most important data, be sure to socialize them with key stakeholders, so you have that buy-in. Here are some questions to ask:

  • What are our key KPIs?
  • What metrics are we going to measure?
  • Do we understand what are the formulas to calculate them?
  • What rules about how data is pulled into these metrics are needed?
  • What systems do our data reside in?

Consider creating a data charter that clearly states the above so that everyone can refer to it and to help base your overall data strategy.

2. Know your data

So, you’ve selected the most important and business-critical data, and you’ve gotten agreement on it with key people in your organization. And after? To paraphrase a wise Greek philosopher, you need to know your data – how is it created? Where did he come in? How is it maintained?

Take stock of where important business data came from, and how and where it entered your systems. From there, let’s make sure the data you store is accurate; effective and regular cleaning will remove or modify incorrect, incomplete, irrelevant or improperly formatted data. Make sure to include processes to eliminate duplicates and merge various data sets. Deduplication might not be the sexiest thing in data, but it’s one of the most important – and done right, can save you a ton of money and resources.

Due to the variety of databases, file formats, structure, this will take some time and work, but don’t overlook this step. It’s crucial to break down internal silos and create truly valuable data. Proper maintenance and entry point implementations that preserve duplicate records and bad addresses are not negotiable. Without it, your lake will turn back to a swamp before you know it. Organizations make this mistake far too often.

3. Governance is essential for business data

I know. Governance is often seen as controlling, slow and restrictive. But in reality, it helps to assign authority and control over data assets, so that the data is consistent and can be used in an organization.

For many businesses, customer success is one of the most essential KPIs. In order to truly understand the entire customer lifecycle, it goes back to the first marketing contact. Who creates and establishes this client file?

Without good governance, we could have multiple numbers for the same customer, which dilutes the information we have, prevents us from making smart data-driven decisions, and potentially hurts our ability to deliver a great customer experience.

Good governance should also promote compliance with any regulations that affect your organization, whether it’s HIPAA, GPDR, CCPA, POPI, LGPD or beyond.

This previously referenced data charter can serve as a cornerstone of your governance strategy. As a data program continues, it’s easy to lose sight of your original goals. Make sure you refer to them regularly, so that they remain a priority for all stakeholders. Likewise, it’s important not to be too rigid, so if your organization’s requirements change, adjust your data charter accordingly.

Last but not least, transparency is crucial. Internally, this means clear communication between all stakeholders, allowing different departments to pass on their knowledge, while promoting transparency and accountability for maintaining data quality.

Externally, it is imperative to be completely transparent about the customer and prospect data that your company collects. The most obvious reason for this is to avoid falling into the trap of the regulators – Google, WhatsApp, and CaixaBank have all received fines of several million euros for violating the transparency clauses of the GDPR. It’s not worth it.

The more data the better? Not necessarily

More data isn’t always better. Businesses need to be careful when collecting and storing data for which they have limited tangible use. Not only does this present security, privacy, and compliance risks, but storing and managing this data is an unnecessary expense as well. Instead, focus on the data that is valuable and useful – you probably already have more than enough!

Clean, usable, and valuable data has the potential to drive new business growth, streamline operations, improve customer relationships, and enhance agility. Who wouldn’t want that?

For more than three decades, Kevin Campbell has passionately led innovation and growth in global Fortune 500 organizations and start-ups. He is currently CEO of Syniti.

Data makers

Welcome to the VentureBeat community!

DataDecisionMakers is the place where experts, including data technicians, can share data-related ideas and innovations.

If you’re interested in learning more about cutting-edge ideas and up-to-date information, best practices, and the future of data and data technology, join us at DataDecisionMakers.

You might even consider contributing your own article!

Read more about DataDecisionMakers

Barbara M. Stokes