Thursday, October 1, 2015

A guide to the little jobs that will make big data work


A guide to the little jobs that will make big data work

[Blog post] Time to get your hands dirty.

Big data is one of the more glamorous terms in today’s IT vernacular, but in reality making it work is about small dirty jobs that take up a lot of resources for little immediate return.
As business adopts obsessive focus on ‘the customer’, there is a clear divide opening up between companies such as Amazon and Uber that run their businesses on analytics, and the silent majority simply trying to do things a little bit better.
The secret to staying on the right side of this gap is investing early into the little things, like metadata governance, standardisation and data glossaries.

Here are a few simple tips to guide the formative days of your business’s big data future.


Cleaning up the metadata

The big data holy grail for enterprises is a single customer view.
But the reality is that most organisations probably have around eight to 10 different customer databases that all exist to support different transactional systems. Each customer probably has a different ID in each of these accounts.

This is only going to become more daunting as the internet of things becomes a day-to-day reality and sensors are used to detect a customer instore, track their browsing behaviour and make them offers via their devices.

We have to remember that big data usually hasn’t been cleaned up and integrated into a single source of truth – indeed the opposite is the case.

To understand what is in the data lake, we need high quality metadata to track the various data stores and to distil some meaning from them.


Metadata Tsars

This is where good quality governance becomes critical.
Most enterprises have some form of data governance, but its focus is usually restricted to higher level priorities than metadata.

But you really can’t make any sense of the vast amounts of data unless you have a comprehensive metadata management approach.

This means taking on historical and headache-inducing problems like data types and data names that are not always consistent, like dates being stored as variable character fields.

Correcting these is a significant exercise with often little to show for it until a sizable investment has been already made. Unfortunately these are jobs that can’t be avoided.


Data lakes not data dumps

There is no point in building a data lake if this information can’t be accessed. That is a data dump.
Enterprises have traditionally struggled to implement data warehouses.

At best they have been a reasonable place for basic reporting systems. At worst, the shortfalls have resulted in a proliferation of these environments and the truth is that most enterprises now have a number of data warehouses.

The current architecture landscape would appear to be splintered into a number of separate data stacks.

We have learnt this lesson so let’s not repeat the same mistakes when it comes to big data.


Think like a librarian

The right approach is to do what librarians do and ensure you establish a data glossary to catalogue the enterprise data sources.

This does not need to be all-encompassing and you do not need to boil the ocean. Instead you can build a common data set of critical business data elements. What you will be focused on will be enriching the catalogue so sources are noted and applications that use this data are tracked.

Like a library, this enables sharing. Thus anyone in the business can now use their own BI tool of choice to access a shared and validated database.

For financial services, it is also critical to maintain data lineage, and in essence that means that regulated data is never deleted. Therefore if we find an issue and want to correct this, we need to maintain a history of these changes by appending rather than overwriting.


Elephants are afraid of mice

Doug Cutting’s daughter had a toy elephant that was named Hadoop - and that inspired the name of his influential big data product. But most of us would think of elephants as giant marauding beasts in Saharan Africa.

I’ve often heard that elephants are afraid of mice and in the data world at least this seems to hold true.
A giant Hadoop database cluster looks all powerful and strong, but without the small things like high quality metadata being implemented, the elephant is much weaker than one would expect.

When it comes to big data the small things really matter.

No comments: