Fixing Past Mistakes
Anyone reading this article already understands the importance of Data Classification. Increased regulation has been a by-product of greater understanding of the dangers of sub-standard data privacy policies. Despite the fervor around having a data classification process in place, most organizations I’ve spoken to are not worried about their ability to tag new data resources with the correct classification going forward.
Most businesses are worried about classifying their data that has never been tagged before. This includes tagging their known sensitive data assets as well as discovering and tagging their unknown sensitive data assets. Unknown sensitive data could include databases, buckets, and snapshots that they don’t know exist, for example in unauthorized regions in the public cloud. Unknown sensitive data assets could also include sensitive data that is stored in a database that was presumed to be non-sensitive, for example a public bucket that contains personally identifiable information (PII).
A problem most folks are not talking about deeply yet is their continuous data classification process. How will organizations validate classifications, re-classify, correct data misplacements, and re-tag data assets over time. This becomes a maturity concern after initial data classification efforts are in place.
Tagging = Nagging
Relying on a manual tagging process has some limitations even if asset tagging is shifted-left to the owners of the data asset. Even if all the assets are tagged by sensitivity upon creation, there will always be a need for identifying and remediating data misclassifications. These misclassifications are typically detected by a Data Loss Prevention (DLP) scanning tool. The burden on teams to manually verify and remediate DLP findings will be significant and require significant chasing of issues.
DLP != Reality
A naive alternative to the above would be to automatically classify data assets per a DLPs findings. While this solves some problems related to the manual effort of validation and tagging, DLP tools (like any scanner) are prone to false positives. From a business perspective every scanning tool has more to lose by not identifying a real issue as opposed to generating false positives. Hence, every scanner will always produce false positives up until the threshold where it starts costing them business. Many organizations have complaints when testing DLP tools about these incorrect sensitivity classifications. To blindly trust them would be a mistake. There are certain types of data classifications based on logical access separation such as “public”, “internal”, and “restricted” (whether network or IAM restricted) that DLPs struggle to natively identify. This is because these classifications are dependent on the infrastructure, network, datastore, and IAM access configurations surrounding the data asset. On the other hand, DLPs excel at identifying PII, credit card, and health data via data object level scans.
J1 and Done
OK, so how does JupiterOne fit into the Data Classification problem? We are an asset and relationship discovery tool that can provide context to DLP scan findings for more intelligent tagging.
JupiterOne is extremely good at analyzing infrastructure, network, datastore, and access configurations all of which provide context to data asset classifications such as public, internal, network restricted, and IAM restricted. When paired with a DLP tool that can identify PII, PCI, or PHI we can automate what would otherwise be manual validations of DLP findings.
For example, in the following graph visualization, we have queried for AWS Lambda Functions that are internet facing, have access to buckets that are not classified as ‘public’, and those buckets have sensitive data findings from a DLP scanner (like Amazon Macie). This enables data classification automation that is not purely driven by DLP findings, but also verification of real logical network and IAM access configurations. In this case, these buckets could potentially be classified as ‘public’ depending on our definition of internal, and the fact there is sensitive data raises a conflict that needs to be looked at.
Automated Tagging = Bragging rights
JupiterOne’s automation platform is driven via our open API, open SDKs, and terraform provider. This allows users to leverage queries and alert rules for continuous monitoring and to trigger additional actions on the results, such as a webhook to a lambda for tagging. We will save this exercise for a follow up blog.
Monitoring for violations and conflicts in the alert rules is a natural derivative of the process. Automating data classification tags either at the source or within JupiterOne would both be reasonable options. Tagging datastores at the source directly will allow those tags to be automatically ingested into JupiterOne. Tagging datastores solely within JupiterOne will not tag the assets at the source. One benefit to having the tags only exist in JupiterOne is to prevent manipulation of tags on assets by internal asset owners.