How do we define big data?
Ok while I fully expect every individual or company to add its own personal tweaks here or there, here is the one-sentence definition of big data to get he conversation really started.
Big data is a collection of data from traditional and digital sources inside and outside of a company that represents a source for ongoing discovery and analysis. Some people like to constrain big data to digital inputs like web behaviour and social network interactions, however we cannot exclude traditional data derived from product transaction information, financial records and interaction channels, such as call centre and point-of-sale. All of that is big data too, even though it may be dwarfed by the volume of digital data that is now growing at an exponential rate. In defining big data it is very important to understand the mix of unstructured and multi-structured data that comprises the volume of information.
Unstructured data comes from information that is not organized or easily interpreted by traditional databases or data models, and typically, it is text heavy. Metadata, Twitter tweets, and social media posts are good examples of unstructured data.Multi-structured data refers to a variety of data formats and types and can be derived from interactions between people and machines, such as web applications or social networks. A great example of this would be web log data, which includes a combination of text and visual images along with structured data like form or transactional information.
Every enterprise needs to fully understand big data – what it is to them, what it does for them, and what it means to them. The importance of big data is immense, this can be achieved through a multitude of different features which offer a broad spectrum within organisations, examples of these could be a data analysis tool, data warehouse testing, data asset management and comparative data analysis all of which provide interactive aids in the big data phenomenon.
Information is arguably the most important fuel businesses run on. Intellectual property such as patents, institutional knowledge collected and stored by employees, sentiment gleaned from millions of social media posts, and consumer insights from the analysis of myriad online transactions are just a few examples of information assets companies leverage today. Companies all over the world need to wake up to the reality that information governance is more important in the era of big data than it was beforehand. New big data tools leveraging technology such as Hadoop can process and analyse high volumes of data at reasonable costs, creating business intelligence that companies can use for competitive advantage. The beauty of Hadoop is that business users can keep everything which is crucial because organisations do not want to archive or delete meaningful data.
Ok this brings us on nicely to “what is meaningful data” or “how does a company know what data is meaningful”, Business Intelligence or (BI) programs can make sense of structured data, giving companies a good – or even exact – sense of what data is meaningful. The percentage of a companies information volume that consists of structured data is surprisingly fairly small, so information hoarding in order to leverage big data tools may work in the structured data world, but will not work in the broader information world that includes unstructured content, most of which is duplicate information or unnecessary (think of all the junk and transitory email). You might ask, how do we know what to delete?, well according to the experts the current methods of information classification are inconsistent and do not scale well. The most deleted content? Email which is broadly time-based deletion, meaning that companies delete email after a certain amount of time, this ultimately could lead to deleting valuable information. What is needed is a way to analyse information automatically, with some human review to judge its business value. While BI has gained mainstream traction in the structured data world, content analytics have not yet in the unstructured content world. What companies must understand is that big data and the intelligence it can deliver is good and worthy of embracing, and that effective information governance not only helps make business operations more efficient, but very importantly mitigates risk. Most organizations are so busy just trying to manage structured information that they haven’t yet addressed unstructured content , much less given enough attention to litigation risk associated with information. Now is the time.