December 12, 2014
Along with Big Data, the word unstructured data is also gaining popularity. You must be wondering “So what is Unstructured Data”, we will try to give you detailed information about it in this blog post.
Unstructured data is defined as data that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional computer programs as compared to data stored in fielded form in databases or annotated (semantically tagged) in documents.
Simply put, any data that cannot be understood by a computer is called unstructured. Most of the unstructured data is in a format that is difficult for traditional computer programs to understand.
Text at this point may not seem like a huge deal, mostly because mining data from text has been around for a long time but a huge portion of data generated by humans is in the form of auditory and visual types. Which is not readable by computer programs.
Software that generates machine-processable structure exploits the linguistic, auditory, and visual structure inherent in all forms of human communication. Algorithms can deduce this inherent structure from text, for instance, by probing word morphology, sentence syntax, and other small- and large-scale patterns. Unstructured information can then be enhanced and tagged to address obscurities and relevancy-based techniques are then used to facilitate search and discovery.
Examples of “unstructured data” may include books, journals, documents, metadata, health records, audio, video, analog data, images, files, and unstructured text such as the body of an e-mail message, Web page, or word-processor document. While the main content being conveyed does not have a defined structure, it generally comes packaged in objects (e.g. in files or documents…) that themselves have structure and are thus a mix of structured and unstructured data, but collectively this is still referred to as “unstructured data”.
For example, an HTML web page is tagged, but HTML mark-up typically serves solely for rendering. It does not capture the meaning or function of tagged elements in ways that support automated processing of the information content of the page. XHTML tagging does allow machine processing of elements, although it typically does not capture or convey the semantic meaning of tagged terms.