Cloud-scale analytics recommends you consider the following requirements for governing data:
One challenge is that data is being collected and stored in multiple places across the enterprise. The data might include data collected and stored in different geographies and different legal jurisdictions. As a result, different legislation might apply to governing the same data in different jurisdictions. Discover data distributed across multiple clouds and geographic locations, to:
Data classification is a way of categorizing data assets by assigning unique logical tags or classes to the data assets. Classification is based on the business context of the data.
There needs to be a way to classify data to understand its level of confidentiality and how long to keep it. The classification requires:
Classification | Description |
---|---|
Public | Anyone can access the data and it can be sent to anyone. For example, open government data. |
Internal use only | Only employees can access the data and it can’t be sent outside the company. |
Confidential | The data can be shared only if it’s needed for a specific task. The data can’t be sent outside the company without a non-disclosure agreement. |
Sensitive (personal data) | The data contains private information, which must be masked and shared only on a need-to-know basis for a limited time. The data can’t be sent to unauthorized personnel or outside the company. |
Restricted | The data can be shared only with named individuals who are accountable for its protection. For example, legal documents or trade secrets. |
Retention | Description |
---|---|
None | Data can be deleted at any time. |
Temporary | Keep data for a short period of time. For example, keep Twitter data for a week. |
Fixed period | Keep data for a set number of years, after which it can be deleted. For example, keep tax records for seven years to comply with government laws. |
Permanent | Never delete data. For example, legal correspondence. |
Automating the data confidentiality and data lifecycle retention classification process using the classes defined in each scheme is needed to consistently label data across the distributed data landscape. The automation enables it to be consistently and correctly governed. Then, define rules and policies for each class in the classification scheme to specify how to govern data according to its classification.
Another requirement is the need for accountability. Otherwise confusion lingers as to who is accountable for governing data. If there’s no accountability, how do you answer the following questions?
Roles and responsibilities are needed to avoid confusion and to set the foundation upon which a data culture can materialize.
Processes are needed, along with roles and responsibilities to:
Define policies and rules to govern:
Associate these policies and rules with each class in the data governance classification schemes.
Another requirement in governing data is master data management. Master data is the most widely shared data in any organization and includes core data entities. Core data entities include customer, supplier, materials, employee, and asset. It also includes financial chart of accounts data that is found in different financial applications. Because master data is so widely shared, it’s application agnostic. It’s needed by both operational transaction processing applications and analytical systems. Keeping this data synchronized can resolve so many data errors and process errors. So, maintaining it centrally via a common process and synchronizing every system that needs it, is the ideal situation. Also, governance is needed over who is allowed to maintain it and where that maintenance needs to happen.
The same applies to reference data such as code sets and financial markets data. In this case, standardization and synchronization of code sets is known as reference data management, which is also a requirement.
Finally, there’s a requirement for metadata lineage. You can use an audit trail to know where data originated and how it’s transformed on route to a report or a data store. You can use metadata to trace who or what is maintaining data, including when and where it occurs.
You need an end-to-end solution that can govern data throughout its lifecycle across data stores in the edge, multiple clouds, and the datacenter.
Your data governance solution should have several components: