Automating Data Classification with JupiterOne

July 28, 2021

Fixing Past Mistakes

Anyone reading this article already understands the importance of Data Classification. Increased regulation has been a by-product of greater understanding of the dangers of sub-standard data privacy policies. Despite the fervor around having a data classification process in place, most organizations I’ve spoken to are not worried about their ability to tag new data resources with the correct classification going forward.

Most businesses are worried about classifying their data that has never been tagged before. This includes tagging their known sensitive data assets as well as discovering and tagging their unknown sensitive data assets. Unknown sensitive data could include databases, buckets, and snapshots that they don’t know exist, for example in unauthorized regions in the public cloud. Unknown sensitive data assets could also include sensitive data that is stored in a database that was presumed to be non-sensitive, for example a public bucket that contains personally identifiable information (PII).

A problem most folks are not talking about deeply yet is their continuous data classification process. How will organizations validate classifications, re-classify, correct data misplacements, and re-tag data assets over time. This becomes a maturity concern after initial data classification efforts are in place.

Tagging = Nagging

Relying on a manual tagging process has some limitations even if asset tagging is shifted-left to the owners of the data asset. Even if all the assets are tagged by sensitivity upon creation, there will always be a need for identifying and remediating data misclassifications. These misclassifications are typically detected by a Data Loss Prevention (DLP) scanning tool. The burden on teams to manually verify and remediate DLP findings will be significant and require significant chasing of issues.

DLP != Reality

A naive alternative to the above would be to automatically classify data assets per a DLPs findings. While this solves some problems related to the manual effort of validation and tagging, DLP tools (like any scanner) are prone to false positives. From a business perspective every scanning tool has more to lose by not identifying a real issue as opposed to generating false positives. Hence, every scanner will always produce false positives up until the threshold where it starts costing them business. Many organizations have complaints when testing DLP tools about these incorrect sensitivity classifications. To blindly trust them would be a mistake. There are certain types of data classifications based on logical access separation such as “public”, “internal”, and “restricted” (whether network or IAM restricted) that DLPs struggle to natively identify. This is because these classifications are dependent on the infrastructure, network, datastore, and IAM access configurations surrounding the data asset. On the other hand, DLPs excel at identifying PII, credit card, and health data via data object level scans.

J1 and Done

OK, so how does JupiterOne fit into the Data Classification problem? We are an asset and relationship discovery tool that can provide context to DLP scan findings for more intelligent tagging.

JupiterOne is extremely good at analyzing infrastructure, network, datastore, and access configurations all of which provide context to data asset classifications such as public, internal, network restricted, and IAM restricted. When paired with a DLP tool that can identify PII, PCI, or PHI we can automate what would otherwise be manual validations of DLP findings.

For example, in the following graph visualization, we have queried for AWS Lambda Functions that are internet facing, have access to buckets that are not classified as ‘public’, and those buckets have sensitive data findings from a DLP scanner (like Amazon Macie). This enables data classification automation that is not purely driven by DLP findings, but also verification of real logical network and IAM access configurations. In this case, these buckets could potentially be classified as ‘public’ depending on our definition of internal, and the fact there is sensitive data raises a conflict that needs to be looked at.

‍

FIND Internet


that allows aws_security_group
  
that protects Function
  
that assigned AccessRole
  
that assigned AccessPolicy

that allows DataStore with classification != 'public'

that has Finding with hasSensitiveData=true

Automated Tagging = Bragging rights

JupiterOne’s automation platform is driven via our open API, open SDKs, and terraform provider. This allows users to leverage queries and alert rules for continuous monitoring and to trigger additional actions on the results, such as a webhook to a lambda for tagging. We will save this exercise for a follow up blog.

Monitoring for violations and conflicts in the alert rules is a natural derivative of the process. Automating data classification tags either at the source or within JupiterOne would both be reasonable options. Tagging datastores at the source directly will allow those tags to be automatically ingested into JupiterOne. Tagging datastores solely within JupiterOne will not tag the assets at the source. One benefit to having the tags only exist in JupiterOne is to prevent manipulation of tags on assets by internal asset owners.

Akash Ganapathi

Akash Ganapathi comes from an enterprise security, data privacy, and data analysis background, working exclusively in the B2B software solutions space throughout his career. He is currently a Principal Security Solutions Architect at JupiterOne.

Keep Reading

What’s New in Kubernetes Security: More CIS Rules, Smarter Detection, and Expanded Coverage | JupiterOne

June 13, 2025

Blog

What’s New in Kubernetes Security: More CIS Rules, Smarter Detection, and Expanded Coverage

New Kubernetes Rule Pack covers more CIS controls for Namespace and Secrets Management

Level Up Kubernetes Security with Our New Rule Pack Built on the CIS Benchmark | JupiterOne

June 3, 2025

Blog

Level Up Kubernetes Security with Our New Rule Pack Built on the CIS Benchmark

New Kubernetes Rule Pack covers 26 CIS controls for RBAC & Pod Security

Stop Stitching User Data Together. Get a Unified Identity Instead | JupiterOne

May 15, 2025

Blog

Stop Stitching User Data Together. Get a Unified Identity Instead.

Legacy IAM falls short. Identity first security uses continuous, contextual access controls to protect a decentralized world—far beyond the old perimeter.

15 Mar 2022

Blog

One line headline, one line headline

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud eiut.

15 Mar 2022

Blog

One line headline, one line headline

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud eiut.

15 Mar 2022

Blog

One line headline, one line headline

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud eiut.

Automating Data Classification with JupiterOne

Fixing Past Mistakes

Tagging = Nagging

DLP != Reality

J1 and Done

Automated Tagging = Bragging rights

Akash Ganapathi

SUBSCRIBE TO OUR NEWSLETTER

Stay up-to-date on emerging threats and security news

Keep Reading