Monday, September 26, 2022
HomeBig DataDesign issues for Amazon EMR on EKS in a multi-tenant Amazon EKS...

Design issues for Amazon EMR on EKS in a multi-tenant Amazon EKS atmosphere

Many AWS prospects use Amazon Elastic Kubernetes Service (Amazon EKS) with a view to make the most of Kubernetes with out the burden of managing the Kubernetes management aircraft. With Kubernetes, you’ll be able to centrally handle your workloads and supply directors a multi-tenant atmosphere the place they will create, replace, scale, and safe workloads utilizing a single API. Kubernetes additionally means that you can enhance useful resource utilization, scale back value, and simplify infrastructure administration to help totally different utility deployments. This mannequin is helpful for these operating Apache Spark workloads, for a number of causes. For instance, it means that you can have a number of Spark environments operating concurrently with totally different configurations and dependencies which are segregated from one another by means of Kubernetes multi-tenancy options. As well as, the identical cluster can be utilized for varied workloads like machine studying (ML), host purposes, knowledge streaming and thereby lowering operational overhead of managing a number of clusters.

AWS presents Amazon EMR on EKS, a managed service that allows you to run your Apache Spark workloads on Amazon EKS. This service makes use of the Amazon EMR runtime for Apache Spark, which will increase the efficiency of your Spark jobs in order that they run sooner and value much less. If you run Spark jobs on EMR on EKS and never on self-managed Apache Spark on Kubernetes, you’ll be able to make the most of automated provisioning, scaling, sooner runtimes, and the event and debugging instruments that Amazon EMR gives

On this put up, we present tips on how to configure and run EMR on EKS in a multi-tenant EKS cluster that may utilized by your varied groups. We deal with multi-tenancy by means of 4 matters: community, useful resource administration, value administration, and safety.


All through this put up, we use terminology that’s both particular to EMR on EKS, Spark, or Kubernetes:

  • Multi-tenancy – Multi-tenancy in Kubernetes can are available three types: exhausting multi-tenancy, smooth multi-tenancy and sole multi-tenancy. Arduous multi-tenancy means every enterprise unit or group of purposes will get a devoted Kubernetes; there isn’t any sharing of the management aircraft. This mannequin is out of scope for this put up. Comfortable multi-tenancy is the place pods would possibly share the identical underlying compute useful resource (node) and are logically separated utilizing Kubernetes constructs by means of namespaces, useful resource quotas, or community insurance policies. A second method to obtain multi-tenancy in Kubernetes is to assign pods to particular nodes which are pre-provisioned and allotted to a selected workforce. On this case, we discuss sole multi-tenancy. Until your safety posture requires you to make use of exhausting or sole multi-tenancy, you’ll wish to think about using smooth multi-tenancy for the next causes:
    • Comfortable multi-tenancy avoids underutilization of sources and waste of compute sources.
    • There’s a restricted variety of managed node teams that can be utilized by Amazon EKS, so for big deployments, this restrict can shortly develop into a limiting issue.
    • In sole multi-tenancy there may be excessive probability of ghost nodes with no pods scheduled on them on account of misconfiguration as we power pods into devoted nodes with label, taints and tolerance and anti-affinity guidelines.
  • Namespace – Namespaces are core in Kubernetes and a pillar to implement smooth multi-tenancy. With namespaces, you’ll be able to divide the cluster into logical partitions. These partitions are then referenced in quotas, community insurance policies, service accounts, and different constructs that assist isolate environments in Kubernetes.
  • Digital cluster – An EMR digital cluster is mapped to a Kubernetes namespace that Amazon EMR is registered with. Amazon EMR makes use of digital clusters to run jobs and host endpoints. A number of digital clusters could be backed by the identical bodily cluster. Nonetheless, every digital cluster maps to 1 namespace on an EKS cluster. Digital clusters don’t create any energetic sources that contribute to your invoice or require lifecycle administration outdoors the service.
  • Pod template – In EMR on EKS, you’ll be able to present a pod template to regulate pod placement, or outline a sidecar container. This pod template could be outlined for executor pods and driver pods, and saved in an Amazon Easy Storage Service (Amazon S3) bucket. The S3 places are then submitted as a part of the applicationConfiguration object that’s a part of configurationOverrides, as outlined within the EMR on EKS job submission API.

Safety issues

On this part, we handle safety from totally different angles. We first focus on tips on how to defend IAM function that’s used for operating the job. Then handle tips on how to defend secrets and techniques use in jobs and at last we focus on how one can defend knowledge whereas it’s processed by Spark.

IAM function safety

A job submitted to EMR on EKS wants an AWS Identification and Entry Administration (IAM) execution function to work together with AWS sources, for instance with Amazon S3 to get knowledge, with Amazon CloudWatch Logs to publish logs, or use an encryption key in AWS Key Administration Service (AWS KMS). It’s a finest observe in AWS to use least privilege for IAM roles. In Amazon EKS, that is achieved by means of IRSA (IAM Function for Service Accounts). This mechanism permits a pod to imagine an IAM function on the pod stage and never on the node stage, whereas utilizing short-term credentials which are offered by means of the EKS OIDC.

IRSA creates a belief relationship between the EKS OIDC supplier and the IAM function. This methodology permits solely pods with a service account (annotated with an IAM function ARN) to imagine a task that has a belief coverage with the EKS OIDC supplier. Nonetheless, this isn’t sufficient, as a result of it could permit any pod with a service account throughout the EKS cluster that’s annotated with a task ARN to imagine the execution function. This have to be additional scoped down utilizing circumstances on the function belief coverage. This situation permits the assume function to occur provided that the calling service account is the one used for operating a job related to the digital cluster. The next code exhibits the construction of the situation so as to add to the belief coverage:

    "Model": "2012-10-17",
    "Assertion": [
            "Effect": "Allow",
            "Principal": {
                "Federated": <OIDC provider ARN >
            "Action": "sts:AssumeRoleWithWebIdentity"
            "Condition": { "StringLike": { “<OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:emr-containers-sa-*-*-<AWS_ACCOUNT_ID>-<BASE36_ENCODED_ROLE_NAME>”} }

To scope down the belief coverage utilizing the service account situation, you want to run the next the command with AWS CLI:

aws emr-containers update-role-trust-policy 
–cluster-name cluster 
–namespace namespace 
–role-name iam_role_name_for_job_execution

The command will the add the service account that can be utilized by the spark shopper, Jupyter Enterprise Gateway, Spark kernel, driver or executor. The service accounts identify have the next construction emr-containers-sa-*-*-<AWS_ACCOUNT_ID>-<BASE36_ENCODED_ROLE_NAME>.

Along with the function segregation provided by IRSA, we suggest blocking entry to occasion metadata as a result of a pod can nonetheless inherit the rights of the occasion profile assigned to the employee node. For extra details about how one can block entry to metadata, consult with Limit entry to the occasion profile assigned to the employee node.

Secret safety

Someday a Spark job must devour knowledge saved in a database or from APIs. More often than not, these are protected with a password or entry key. The most typical method to move these secrets and techniques is thru atmosphere variables. Nonetheless, in a multi-tenant atmosphere, this implies any person with entry to the Kubernetes API can probably entry the secrets and techniques within the atmosphere variables if this entry isn’t scoped properly to the namespaces the person has entry to.

To beat this problem, we suggest utilizing a Secrets and techniques retailer like AWS Secrets and techniques Supervisor that may be mounted by means of the Secret Retailer CSI Driver. The advantage of utilizing Secrets and techniques Supervisor is the power to make use of IRSA and permit solely the function assumed by the pod entry to the given secret, thereby bettering your safety posture. You’ll be able to consult with the finest practices information for pattern code displaying using Secrets and techniques Supervisor with EMR on EKS.

Spark knowledge encryption

When a Spark utility is operating, the motive force and executors produce intermediate knowledge. This knowledge is written to the node native storage. Anybody who is ready to exec into the pods would be capable of learn this knowledge. Spark helps encryption of this knowledge, and it may be enabled by passing --conf As a result of this configuration provides efficiency penalty, we suggest enabling knowledge encryption just for workloads that retailer and entry extremely delicate knowledge and in untrusted environments.

Community issues

On this part we focus on tips on how to handle networking throughout the cluster in addition to outdoors the cluster. We first handle how Spark deal with cross executors and driver communication and tips on how to safe it. Then we focus on tips on how to prohibit community visitors between pods within the EKS cluster and permit solely visitors destined to EMR on EKS. Final, we focus on tips on how to prohibit visitors of executors and driver pods to exterior AWS service visitors utilizing safety teams.

Community encryption

The communication between the motive force and executor makes use of RPC protocol and isn’t encrypted. Beginning with Spark 3 within the Kubernetes backed cluster, Spark presents a mechanism to encrypt communication utilizing AES encryption.

The driving force generates a key and shares it with executors by means of the atmosphere variable. As a result of the secret is shared by means of the atmosphere variable, probably any person with entry to the Kubernetes API (kubectl) can learn the important thing. We suggest securing entry in order that solely approved customers can have entry to the EMR digital cluster. As well as, it’s best to arrange Kubernetes role-based entry management in such a means that the pod spec within the namespace the place the EMR digital cluster runs is granted to only some chosen service accounts. This methodology of passing secrets and techniques by means of the atmosphere variable would change sooner or later with a proposal to make use of Kubernetes secrets and techniques.

To allow encryption, RPC authentication should even be enabled in your Spark configuration. To allow encryption in-transit in Spark, it’s best to use the next parameters in your Spark config:

--conf spark.authenticate=true


Notice that these are the minimal parameters to set; consult with Encryption from the whole checklist of parameters.

Moreover, making use of encryption in Spark has a unfavorable influence on processing velocity. It’s best to solely apply it when there’s a compliance or regulation want.

Securing Community visitors throughout the cluster

In Kubernetes, by default pods can talk over the community throughout totally different namespaces in the identical cluster. This habits isn’t all the time fascinating in a multi-tenant atmosphere. In some situations, for instance in regulated industries, to be compliant you wish to implement strict management over the community and ship and obtain visitors solely from the namespace that you simply’re interacting with. For EMR on EKS, it could be the namespace related to the EMR digital cluster. Kubernetes presents constructs that assist you to implement community insurance policies and outline fine-grained management over the pod-to-pod communication. These insurance policies are applied by the CNI plugin; in Amazon EKS, the default plugin can be the VPC CNI. A coverage is outlined as follows and is utilized with kubectl:

Form: NetworkPolicy
  identify: default-np-ns1
  namespace: <EMR-VC-NAMESPACE>
  podSelector: {}
  - Ingress
  - Egress
  - from:
    - namespaceSelector:
          nsname: <EMR-VC-NAMESPACE>

Community visitors outdoors the cluster

In Amazon EKS, while you deploy pods on Amazon Elastic Compute Cloud (Amazon EC2) situations, all of the pods use the safety group related to the node. This may be a problem in case your pods (executor pods) are accessing an information supply (specifically a database) that enables visitors based mostly on the supply safety group. Database servers typically prohibit community entry solely from the place they’re anticipating it. Within the case of a multi-tenant EKS cluster, this implies pods from different groups that shouldn’t have entry to the database servers, would be capable of ship visitors to it.

To beat this problem, you should use safety teams for pods. This characteristic means that you can assign a selected safety group to your pods, thereby controlling the community visitors to your database server or knowledge supply. It’s also possible to consult with the finest practices information for a reference implementation.

Price administration and chargeback

In a multi-tenant atmosphere, value administration is a essential topic. You might have a number of customers from varied enterprise models, and also you want to have the ability to exactly chargeback the price of the compute useful resource they’ve used. Firstly of the put up, we launched three fashions of multi-tenancy in Amazon EKS: exhausting multi-tenancy, smooth multi-tenancy, and sole multi-tenancy. Arduous multi-tenancy is out of scope as a result of the associated fee monitoring is trivial; all of the sources are devoted to the workforce utilizing the cluster, which isn’t the case for sole multi-tenancy and smooth multi-tenancy. Within the subsequent sections, we focus on these two strategies to trace the associated fee for every of mannequin.

Comfortable multi-tenancy

In a smooth multi-tenant atmosphere, you’ll be able to carry out chargeback to your knowledge engineering groups based mostly on the sources they consumed and never the nodes allotted. On this methodology, you utilize the namespaces related to the EMR digital cluster to trace how a lot sources had been used for processing jobs. The next diagram illustrates an instance.

Diagram -1 Comfortable multi-tenancy

Monitoring sources based mostly on the namespace isn’t a straightforward activity as a result of jobs are transient in nature and fluctuate of their length. Nonetheless, there are associate instruments out there that assist you to hold monitor of the sources used, resembling Kubecost, CloudZero, Vantage, and plenty of others. For directions on utilizing Kubecost on Amazon EKS, consult with this weblog put up on value monitoring for EKS prospects.

Sole multi-tenancy

For sole multi-tenancy, the chargeback is completed on the occasion (node) stage. Every member in your workforce makes use of a selected set of nodes which are devoted to it. These nodes aren’t all the time operating, and are spun up utilizing the Kubernetes auto scaling mechanism. The next diagram illustrates an instance.

Diagram to illustrate Sole tenancy

Diagram -2 Sole tenancy

With sole multi-tenancy, you utilize a value allocation tag, which is an AWS mechanism that means that you can monitor how a lot every useful resource has consumed. Though the tactic of sole multi-tenancy isn’t environment friendly when it comes to useful resource utilization, it gives a simplified technique for chargebacks. With the associated fee allocation tag, you’ll be able to chargeback a workforce based mostly on all of the sources they used, like Amazon S3, Amazon DynamoDB, and different AWS sources. The chargeback mechanism based mostly on the associated fee allocation tag could be augmented utilizing the just lately launched AWS Billing Conductor, which lets you concern payments internally to your workforce.

Useful resource administration

On this part, we focus on issues relating to useful resource administration in multi-tenant clusters. We briefly focus on matters like sharing sources graciously, setting guard rails on useful resource consumption, strategies for making certain sources for time delicate and/or essential jobs, assembly fast useful resource scaling necessities and at last value optimization practices with node selectors.

Sharing sources

In a multi-tenant atmosphere, the aim is to share sources like compute and reminiscence for higher useful resource utilization. Nonetheless, this requires cautious capability administration and useful resource allocation to ensure every tenant will get their justifiable share. In Kubernetes, useful resource allocation is managed and enforced through the use of ResourceQuota and LimitRange. ResourceQuota limits sources on the namespace stage, and LimitRange means that you can be sure that all of the containers are submitted with a useful resource requirement and a restrict. On this part, we exhibit how an information engineer or Kubernetes administrator can arrange ResourceQuota as a LimitRange configuration.

The administrator creates one ResourceQuota per namespace that gives constraints for mixture useful resource consumption:

apiVersion: v1
variety: ResourceQuota
  identify: compute-resources
  namespace: teamA
    requests.cpu: "1000"
    requests.reminiscence: 4000Gi
    limits.cpu: "2000"
    limits.reminiscence: 6000Gi

For LimitRange, the administrator can evaluation the next pattern configuration. We suggest utilizing default and defaultRequest to implement the restrict and request area on containers. Lastly, from an information engineer perspective whereas submitting the EMR on EKS jobs, you want to make sure that the Spark parameters of useful resource necessities are throughout the vary of the outlined LimitRange. For instance, within the following configuration, the request for spark.executor.cores=7 will fail as a result of the max restrict for CPU is 6 per container:

apiVersion: v1
variety: LimitRange
  identify: cpu-min-max
  namespace: teamA
  - max:
      cpu: "6"
      cpu: "100m"
      cpu: "500m"
      cpu: "100m"
    sort: Container

Precedence-based useful resource allocation

Diagram Illustrates an example of resource allocation with priority

Diagram – 3 Illustrates an instance of useful resource allocation with precedence.

As all of the EMR digital clusters share the identical EKS computing platform with restricted sources, there can be situations wherein you want to prioritize jobs in a delicate timeline. On this case, high-priority jobs can make the most of the sources and end the job, whereas low-priority jobs which are operating will get stopped and any new pods should wait within the queue. EMR on EKS can obtain this with the assistance of pod templates, the place you specify a precedence class for the given job.

When a pod precedence is enabled, the Kubernetes scheduler orders pending pods by their precedence and locations them within the scheduling queue. In consequence, the higher-priority pod could also be scheduled before pods with decrease precedence if its scheduling necessities are met. If this pod can’t be scheduled, the scheduler continues and tries to schedule different lower-priority pods.

The preemptionPolicy area on the PriorityClass defaults to PreemptLowerPriority, and the pods of that PriorityClass can preempt lower-priority pods. If preemptionPolicy is ready to By no means, pods of that PriorityClass are non-preempting. In different phrases, they will’t preempt every other pods. When lower-priority pods are preempted, the sufferer pods get a grace interval to complete their work and exit. If the pod doesn’t exit inside that grace interval, that pod is stopped by the Kubernetes scheduler. Due to this fact, there may be normally a time hole between the purpose when the scheduler preempts sufferer pods and the time {that a} higher-priority pod is scheduled. If you wish to reduce this hole, you’ll be able to set a deletion grace interval of lower-priority pods to zero or a small quantity. You are able to do this by setting the terminationGracePeriodSeconds possibility within the sufferer Pod YAML.

See the next code samples for precedence class:

variety: PriorityClass
  identify: high-priority
worth: 100
globalDefault: false
description: " Excessive-priority Pods and for Driver Pods."

variety: PriorityClass
  identify: low-priority
worth: 50
globalDefault: false
description: " Low-priority Pods."

One of many key issues whereas templatizing the motive force pods, particularly for low-priority jobs, is to keep away from the identical low-priority class for each driver and executor. This may save the motive force pods from getting evicted and lose the progress of all its executors in a useful resource congestion state of affairs. On this low-priority job instance, now we have used a high-priority class for driver pod templates and low-priority lessons just for executor templates. This fashion, we will guarantee the motive force pods are secure in the course of the eviction strategy of low-priority jobs. On this case, solely executors can be evicted, and the motive force can carry again the evicted executor pods because the useful resource turns into freed. See the next code:

apiVersion: v1
variety: Pod
  priorityClassName: "high-priority"
  nodeSelector: ON_DEMAND
  - identify: spark-kubernetes-driver # This can be interpreted as Spark driver container

apiVersion: v1
variety: Pod
  priorityClassName: "low-priority"
  nodeSelector: SPOT
  - identify: spark-kubernetes-executors # This can be interpreted as Spark executor container

Overprovisioning with precedence

Diagram Illustrates an example of overprovisioning with priority

Diagram – 4 Illustrates an instance of overprovisioning with precedence.

As pods wait in a pending state on account of useful resource availability, extra capability could be added to the cluster with Amazon EKS auto scaling. The time it takes to scale the cluster by including new nodes for deployment must be thought of for time-sensitive jobs. Overprovisioning is an choice to mitigate the auto scaling delay utilizing momentary pods with unfavorable precedence. These pods occupy house within the cluster. When pods with excessive precedence are unschedulable, the momentary pods are preempted to make the room. This causes the auto scaler to scale out new nodes on account of overprovisioning. Bear in mind that this can be a trade-off as a result of it provides larger value whereas minimizing scheduling latency. For extra details about overprovisioning finest practices, consult with Overprovisioning.

Node selectors

EKS clusters can span a number of Availability Zones in a VPC. A Spark utility whose driver and executor pods are distributed throughout a number of Availability Zones can incur inter- Availability Zone knowledge switch prices. To reduce or get rid of the info switch value, it’s best to configure the job to run on a selected Availability Zone and even particular node sort with the assistance of node labels. Amazon EKS locations a set of default labels to establish capability sort (On-Demand or Spot Occasion), Availability Zone, occasion sort, and extra. As well as, we will use customized labels to satisfy workload-specific node affinity.

EMR on EKS means that you can select particular nodes in two methods:

  • On the job stage. Confer with EKS Node Placement for extra particulars.
  • Within the driver and executor stage utilizing pod templates.

When utilizing pod templates, we suggest utilizing on demand situations for driver pods. It’s also possible to think about together with spot situations for executor pods for workloads which are tolerant of occasional durations when the goal capability isn’t utterly out there. Leveraging spot situations assist you to save value for jobs that aren’t essential and could be terminated. Please refer Outline a NodeSelector in PodTemplates.


On this put up, we offered steering on tips on how to design and deploy EMR on EKS in a multi-tenant EKS atmosphere by means of totally different lenses: community, safety, value administration, and useful resource administration. For any deployment, we suggest the next:

  • Use IRSA with a situation scoped on the EMR on EKS service account
  • Use a secret supervisor to retailer credentials and the Secret Retailer CSI Driver to entry them in your Spark utility
  • Use ResourceQuota and LimitRange to specify the sources that every of your knowledge engineering groups can use and keep away from compute useful resource abuse and hunger
  • Implement a community coverage to segregate community visitors between pods

Lastly, in case you are contemplating migrating your spark workload to EMR on EKS you’ll be able to additional study design patterns to handle Apache Spark workload in EMR on EKS on this weblog and about migrating your EMR transient cluster to EMR on EKS on this weblog.

In regards to the Authors

author - lotfiLotfi Mouhib is a Senior Options Architect working for the Public Sector workforce with Amazon Net Companies. He helps public sector prospects throughout EMEA understand their concepts, construct new providers, and innovate for residents. In his spare time, Lotfi enjoys biking and operating.

author - peter ajeebAjeeb Peter is a Senior Options Architect with Amazon Net Companies based mostly in Charlotte, North Carolina, the place he guides international monetary providers prospects to construct extremely safe, scalable, dependable, and cost-efficient purposes on the cloud. He brings over 20 years of know-how expertise on Software program Improvement, Structure and Analytics from industries like finance and telecom.



Please enter your comment!
Please enter your name here

Most Popular