Azure Purview Private Network Architecture and Lesson Learned

This is my 2nd Lab Experiment on Azure Purview with private network deployment architecture . You can read the 1st blog post, Azure Purview to Discover PII in Azure File/Blob, AWS S3 and More, which was deployed with public endpoint option.

Single region, multiple virtual networks

Most of the larger organizations follow hub and spoke network and I wanted to be as close as possible to corporate Azure Network (setting aside resiliency for the moment). With this option, access to Purview Account, Governance Portal and Ingestion endpoints are exposed through Azure Private Endpoints. What that mean is, Account and Portal are NOT accessible from internet and this is very important for security (network isolation). Integration Runtime is Self Hosted and it is deployed in Hub vNET but it can be deployed into Spoke vNET’s. This is important for access to data during the scans because the connectivity is private and you have control over the source network. Three private endpoints, Storage blob | Storage queue | Event Hubs namespace, are created to ingest metadata into managed resources. DNS is another important aspect. I am using Azure Private DNS for private endpoints.

Azure Purview Deployment Architecture (Private Network)

Credential Management

First thing first, you must configure connection to Azure Key Vault where secrets are stored and Purview Managed System Identity (MSI) is given permission to read secrets from Key Vault.

You can use Service Principal for Azure Blob. Azure File can’t use Service Principal or Managed Identity and you have to rely on account key. You will need a Cross Account Role in AWS for accessing S3.

Key Vault Connection in Governance Portal
Configure Credentials in Purview Governance Portal

Configure (register) Data Sources

For this test, I am targeting Azure Blob, Azure File and AWS S3. May be I will try AWS RDS and Microsoft SQL next time!

Azure Blob: You can register storage account that is accessible from private network or public network. In my test, I am using a storage account that can be accessed from vNET only.

Azure File: Even though Azure File can be accessed over private endpoint, you can register Azure File when when storage account endpoint is exposed publicly (not a good option for security!).

AWS S3: Just specify the S3 bucket ARN. Credential is needed when you configure the scan.

Data Source Registration in Purview

Configure Data Source Scans

Azure Blob Scan: Self Hosted Integration Runtime (SHIR) is used with Service Principal authentication. Be Sure to test the connection. If you encounter error, it would be due to permission or storage account is not accessible from SHIR. If you turned on diagnostics at the storage account, you will know the reason from the logs!

Azure Blob Scan Configuration
Azure Blog Scan Result

Azure File: Azure File can be configured with storage account key only. Also, storage account must be accessible from public network (internet) and it’s bad for security. Nonetheless, I configured the scan for testing the power of Purview (only to find out it does not work!). I can configure the scan but scan always fail with generic error. I am very disappointed on the error message as I have no way to investigate the issue. Interestingly, I was able to scan Azure File in my 1st experiment where Purview was deployed with public endpoint.

Another point I have observed, you can’t pick Integration Runtime (SHIR) and it’s likely that Azure Managed Integration Runtime is running into multiple issues.

Azure File Scan Configuration
Azure File Scan Result

AWS S3: AWS IAM is configured at the credential and used at the scan configuration. I can access S3 metadata (folders and files) but scan always fail with generic error! Again, disappointed about not enough information provided in the error message. Interestingly, I was able to scan S3 in my 1st experiment where Purview was deployed with public endpoint.

AWS S3 Scan Configuration
AWS S3 Scan Result

Discovery of Assets

Finally, let us look at the assets discovered and filter by Data Classification to see if Purview found any NPI data and how many- the reason we are here! I used the built-in rules and Purview did find NPI but it could only discover them in Azure Blob in my experiment.

NPI Asset Discovery in Purview

Scan Automation via Purview API

For simplicity, we are going to use Postman to get a token, get a scan definition, run a scan and verify the scan result.

Get Token: https://docs.microsoft.com/en-us/azure/purview/tutorial-using-rest-apis

https://login.microsoftonline.com/{directoryid}/oauth2/token

Get Token

Get a Scan : https://docs.microsoft.com/en-us/rest/api/purview/scanningdataplane/scans/get

Get A Scan

Run a Scan: https://docs.microsoft.com/en-us/rest/api/purview/scanningdataplane/scan-result/run-scan

Run a Scan
Scan Completed via Api

Conclusion

I think, Azure Purview is still at the beginning of NPI scanning market. It took longer duration to scan simple KB/MB of data and it was able to successfully scan Azure Blob. I have depth of knowledge in Azure but I couldn’t figure out why scan fails on Azure File by looking at logs. Same is true for AWS S3, I could discover metadata but scan always fails. I opened up a case at GitHub, hopefully Microsoft will respond with the resolution. If not someone may respond to the post at Stackoverflow.

Leave a Reply