Fortifying Your Data Lake

September 2, 2020 / Satheesh Velayudhan Nair

Cloud Security Data Engineering & Analytics

Given that data is the currency of the digital economy and business grows only as fast as insights from data, organizations are racing to deploy Big Data Analytics (BDA). According to IDC BDA would be at the heart of digital transformation initiatives and by 2022 the worldwide marketsize would be $274.3 billion. The phenomenal growth of BDA is attributed to two factors: explosive growth of digital data and availability of low-cost Cloud storage. This is abated by the maturity of Cloud-native services for advanced analytics such as IoT and AI technologies.

Data Lake is the foundation of Data Analytics strategy—it is the central repository of organizational data from multiple sources to facilitate in-play analytics, security at scale and business continuity.

It comprises data from machine sensors, server logs and click stream
It is usually stored in the original format as unstructured and semi-structured data to make it easily available to anyone in the organization.
It is leveraged differently by organizations—as an archival facility, landing zone, sandbox for data scientists, raw inputs for production workloads.

These characteristics make Data Lake security disproportionately critical as a single breach can compromise vast quantities of data—sensitive information such as financial and operational data of the company; customer records such as financial transactions; Personally Identifiable Information—much of which are governed by regulation and compliance and can have devastating business and financial consequences.

Considering that the objective of Data Lake is to provide access to everyone in the organization, security must be comprehensive yet meticulous designed to provide granular access.

How to Secure Data Lake on AWS Cloud

Securing a Data Lake must be integral to the design and architecture taking into account data transport, usage, connected applications and governance requirements. AWS Lake Formation with Amazon S3 as the core enables to set up a secure Data Lake with a robust set of features and services to protect data against internal and external threats, even in large, multi-tenant environments.

Our vast experience of implementing Data Lakes with robust security is built around the five pillars of access control, data governance and compliance; data protection; data lake usage; and data leak prevention.

Access Control: Controlling access to data and resources is vital to protect misuse either due to malicious intent or negligence. AWS Lake Formation enables to follow the policy of least privilege by defining policies for granular access. You can assign permissions to IAM and federated users based on roles and groups; specify permissions for each analytics service such as Amazon Redshift and Amazon Athena; and control access for catalog objects like tables and columns. For added security, MFA-based authentication can be enabled. Access to data in S3 buckets can be controlled via Access Control Lists, bucket policies and IAM permissions. You can also use AWS Glue Data Catalog resource policies to configure and control access to metadata.

Data Governance & Compliance: Data governance and compliance is achieved with data classification and assigning different levels of access to individuals based on data categorization. Amazon Macie classifies data in S3 based on content type, file extension, etc which you can then tag with specific risk levels. For eg, content identified as credential or a key is assigned Risk Level 1. Depending on the classified data, Amazon Macie provides dashboards and alerts of how the data is being moved and accessed.

Data Protection: Securely ingest and transport data from customer premise to AWS Cloud using secure VPN tunnel and authentication. Data in transit can be protected using Secure Sockets Layer or client-side encryption. Customers can encrypt data before uploading to S3 using a master key stored in AWS Key Management Service and use the same key to decrypt data while downloading. Similarly, data at rest can also can be encrypted using AWS server-side encryption or client-side encryption.

Data protection strategy must consider protection from attacks such as ransomware and combine a backup strategy that includes native automated backup such as S3 versioning, cross-region replication and archiving with Amazon Glacier.

Data Lake Usage: As more data is ingested, Data Lake can turn into a swamp without proper governance. Governance includes maintaining data sources and lineage which has an important bearing on decision making as requirements change and lineage becomes a crucial factor. Also, it helps to maintain data integrity as records of data lineage protect assets against unintentional and malicious deletion, corruption or data tampering by rogue actors. This is achieved with S3 versioning enablement which makes multiple copies of data asset. When an asset is updated, prior versions are retained and available for use.

Data Leak Prevention: Preventive measures such as monitoring and alerts can be programmatically implemented using AWS CloudWatch and GuardDuty to monitor user access and trigger alarm via AWS Lambda function in the event of unauthorized access. AWS CloudTrail tracks user activity and stores logs in S3 for forensic purposes. Lake Formation also enables to see detailed alerts on dashboard, and download audit logs for further analytics. Amazon CloudWatch publishes all data on ingestion events and catalogue notifications to identify suspicious behaviour or demonstrate compliance with rules.

While Data Lakes are a treasure trove it is also a hazard to the organization if the security architecture is not robust. Data security and integrity must be the bedrock on which the data strategy is built and it is not hard to achieve when you work with experienced partners with deep business and technical understanding.

Noventiq Best Practises For Customers

AWS VPN or Open VPN used between on-premise systems and AWS Cloud
All data is uploaded with server-side encryption on Amazon S3
To restrict access to S3 objects, different IAM users and IAM groups for different departments are made
To restrict access to Redshift tables, individual users and Redshift groups are made for different departments. Column level restriction is done on need basis.
S3 and Redshift instances are always private

If you are interested in strengthening the security of Data Lake or want to know more about Data Lake security, please reach out to us.