Addressing big data variety using an automated approach for data characterization

Georgios Vranopoulos*, Nathan Clarke, Shirley Atkinson

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

11 Downloads (Pure)

Abstract

<jats:title>Abstract</jats:title><jats:p>The creation of new knowledge from manipulating and analysing existing knowledge is one of the primary objectives of any cognitive system. Most of the effort on Big Data research has been focussed upon <jats:italic>Volume</jats:italic> and <jats:italic>Velocity</jats:italic>, while <jats:italic>Variety</jats:italic>, “the ugly duckling” of Big Data, is often neglected and difficult to solve. A principal challenge with <jats:italic>Variety</jats:italic> is being able to understand and comprehend the data. This paper proposes and evaluates an automated approach for metadata identification and enrichment in describing Big Data. The paper focuses on the use of self-learning systems that will enable automatic compliance of data against regulatory requirements along with the capability of generating valuable and readily usable metadata towards data classification. Two experiments towards data confidentiality and data identification were conducted in evaluating the feasibility of the approach. The focus of the experiments was to confirm that repetitive manual tasks can be automated, thus reducing the focus of a Data Scientist on data identification and thereby providing more focus towards the extraction and analysis of the data itself. The origin of the datasets used were Private/Business and Public/Governmental and exhibited diverse characteristics in relation to the number of files and size of the files. The experimental work confirmed that: (a) the use of algorithmic techniques attributed to the substantial decrease in false positives regarding the identification of confidential information; (b) evidence that the use of a fraction of a data set along with statistical analysis and supervised learning is sufficient in identifying the structure of information within it. With this approach, the issues of understanding the nature of data can be mitigated, enabling a greater focus on meaningful interpretation of the heterogeneous data.</jats:p>
Original languageEnglish
Number of pages0
JournalJournal of Big Data
Volume9
Issue number1
Early online date10 Jan 2022
DOIs
Publication statusPublished - 10 Jan 2022

Fingerprint

Dive into the research topics of 'Addressing big data variety using an automated approach for data characterization'. Together they form a unique fingerprint.

Cite this