Amazon Comprehend - redaction of the PII data (introduction part)
Introduction
From the previous blog post, we've learned about how to detect the PII code via AWS SDK for Ruby. Before we will deal with the labeling of the PII data, as well as within the code deal with the redaction of the data, let's focus on the redaction part. Within the application, we will make redaction in-house, but let' get familiar with redacting data with Amazon Comprehend.
The goals of this post are:
- What is a redaction of the data.
- What Amazon Comprehend offers in case of redaction of the data.
- Test redaction data with Amazon Comprehend via AWS Dashboard.
General content
Redaction helps to protect PII data. It's a masking technique that allows you to mask (or edit) text by removing or replacing all or part of a field value (either with special characters or with strict information that text included sensitive data).
With Amazon Comprehend, data redaction is possible. For the Redaction mode we can either:
- Select the types of PII entities to redact (22 possible, from the 5 categories).
- Replace with PII entity type (each and every PII entity which will be detected, will be replaced by its type).
- Replace with character (asterisk (*) as a default value, other characters are also possible).
It's important, that the redaction is an async process, which will be an analysis job, to examine your document, and later on, detect PII data, and at the end redact it. Both input and output needed to be files stored on the Amazon S3 bucket.
PII redaction with Amazon Comprehend - AWS Dashboard
Let's move on to the AWS dashboard. Open Amazon Comprehend and then press Launch Amazon Comprehend. In the top left corner, below Amazon Comprehend, press Analysis jobs. You should be redirected to the Analysis jobs page.
Let's create a new job. We need to provide:
- Name of the job.
- Analysis type (either built-in in within Amazon Comprehend, in our case it will be personally identifiable information (PII) type, but we can also use custom entity recognizer or document classifier for a customized analysis).
- The language of the input document (for PII only English is possible).
Inside PII detection settings, select Redactions from the output mode. Then select PII entity types, and finally Replace with PII entity type from the Redaction Mode. This time we will replace PII data with its type.
For the input data, let's choose some example document ( this one, to be more specific). It will automatically fetch it from the already existing S3 bucket. For your own documents simply choose your S3 bucket that contains the document (or set of documents) that you want to analyze.
Important: Your bucket must be in the same AWS Region as the Amazon Comprehend API endpoint that you are calling.
The URI can point to a single file or a collection of data files. For example, if you use the URI S3://bucketname/prefix, if the prefix is a single file, Amazon Comprehend uses that file as input. If more than one file begins with the prefix, Amazon Comprehend uses all of them as input. As you can see, it's very comprehensive.
As an output data, choose the S3 bucket where you want Amazon Comprehend to save the output results from your redaction result. You can also choose to encrypt the output results from your analysis job by choosing the Encryption checkbox. More info on how to do it with the AWS KMS could be found here).
For the access permission, the IAM role you are using must have write permissions for the S3 bucket. If you don't already have an IAM role with these permissions and an appropriate trust policy, choose create an IAM role to create one.
Click on Create a job button. Wait a couple of minutes or more for the analysis job to complete. The length of time varies based on the size of your input documents.
After it will be completed, you can see the job details (Job details section), link to the output file (Output section), as well as selected PII types (under the PII entity types to redact section).
Here is a little part of the redacted text:
Summary
- We've learned about redaction.
- We've learned how we can use Amazon Comprehend in case of redaction of data.
- We've played around with redacting some example documents via AWS Dashboard.