Passing the AWS Certified Data Engineer — Associate Exam in 2025
The AWS Certified Data Engineer — Associate Certification exam is a challenging AWS Certification. This certification validates skills and knowledge in core data-related AWS services, ability to ingest and transform data, orchestrate data pipelines while applying programming concepts, design data models, manage data life cycles, and ensure data quality.
As stated in the AWS Certified Data Engineer — Associate (DEA-C01) Exam Guide covers the following domains:
- Domain 1: Data Ingestion and Transformation(34% of scored content)
- Domain 2: Data Store Management (26% of scored content)
- Domain 3: Data Operations and Support (22% of scored content)
- Domain 4: Data Security and Governance (18% of scored content
Study Approach
The AWS Certified Data Engineer — Associate Site provides a good overview of what this certification is about. Specific details are detailed in the Exam Guide. I almost always warm up in a certification with a good video course and for this one you can use Udemy’s AWS Certified Data Engineer Associate 2024 — Hand On! course which is offered by Stephane Maarek and Frank Kane. You should buy this course with a Udemy coupon and you can try to get them at https://www.datacumulus.com/. You can get a good feel for the content of the course by completing the video course before you move onto the practice questions.
Practice Questions
Preparing for the exam should involve a fair amount of repetition to ensure you understand the material thoroughly. Here is a list of the question sets that I found and used to pass this certification.
- Tutorials Dojo Data Engineer Practice Exam (177 Questions, 2 Full Tests, with randomized full tests)
- Whizlabs Certified Data Engineer Associate Certification (130 Questions, 2 full tests)
- Exam Topics AWS Certified — Associate DEA-C01 Exam (36 Free Questions, 152 overall)
- Udemy Certified Data Engineer Associate 2024 — Hands On! (65 Questions, 1 full test)
- Udemy AWS Certified Data Engineer — 3 Extra Practice Exams! (195 Questions, 3 full tests)
- AWS SkillBuilder Questions (20 Questions)
Practice test training
It is really important to practice with some testing materials to help you get familiar with what to expect on the exam. You will quickly understand your areas of weakness as well. I also think that using multiple question sources will help make you more well rounded.
There are 3 main sets of questions I used and I will discuss them below. Generally, I wanted to get to a point where I would get a 100% on exams just to be sure I had a good handle on the testing material. The testing material is not always the same as the actual exam but maximizing your mastery of the testing material should help with the real exam.
Tutorials Dojo offers good questions where you can practice and save your results to review afterwards. You can work through the different Timed Mode and Review Mode questions until you get comfortable to do the “Final Test” which is a randomized set of 65 questions.
You will start out completing the tests taking a fair amount of time and with less than desired marks. If you keep practicing you should see improvement to high marks and the testing time will come down drastically.
I was getting essentially 100% and completing the 65 set questions in 11–13 minutes at the end when I decided I was ready to attempt the real exam.
Good set of questions that are really indicative of the actual exam questions. There is a bunch of annoying CAPTCHA and password items when using the free version which is understandable.
You have to discern between the “Most Voted” answer and the initial answer that ExamTopics presents. Generally, I would go with the answer that was selected based on the Community vote distribution after reading the comments. There is a Discussion popup and it will show you other viewers comments and generally the “Most Voted” response is the one you should really consider in my opinion. You can also open up the Discussion to see the arguments for the different choices.
Exam Topics does not actually provide 65 question sets and mark them. There currently are not a lot of questions there but they are very realistic
They seem to include some questions that don’t seem to directly apply to the actual certification questions. (Extra information about the topic in general which proves that it was written by someone who is an expert in the field but also information that does not necessarily apply to the exam itself)
Questions are not randomized for most of the questions. This tends to lead you to actually start to remember the right responses based on memory.
Personal Tips on questions
Be consistent in taking your tests to get more comfortable with the material. Try to figure out what is the best time of day for you to practice and try to set up a schedule to practice.
Also, don’t be afraid to get poor marks at the start. View that as an opportunity to improve and each time you take a new test after reviewing the incorrect responses from last time. You will quickly notice that your scores will get better and the time to complete a test will come down as well.
Key Points
The following are some key points to remember/understand when studying for this exam:
Fire-hose is near Real Time/ Kinesis Data Streams is real time
Lake Formation is the best way to govern and secure S3 data
Amazon S3 Object Lambda is useful to use your own code for transformations
S3 and serverless services will not be acceptable for real-time or high speed responses
Glue/Kinesis Data Firehose are not suitable for real-time applications (rather think EMR or Kinesis Data Streams)
KMS is almost always preferred for encryption
Step Functions for workflows that are not just Glue(ETL) specific
AWS Data Exchange — for sharing data sets
AppFlow for 3rd party SAAS applications
Logs almost always deals with AWS CloudWatch
Amazon Macie is useful in identifying PII data but not with redaction (For redaction consider S3 Object Lambda or Glue DataBrew)
CSV files are never query-efficient, always use a format such as Apache Parquet
Kafka Access Control Lists(ACLs) will help secure microservices in Amazon MSK
EventBridge Pipes — are intended for point-to-point integrations between supported sources and targets, with support for advanced transformations and enrichment
Ultrawarm Nodes for OpenSearch Service— provide a cost-effective way to store large amounts of read-only data on Amazon OpenSearch Service.
CloudWatch Container Insights — collect, aggregate, and summarize metrics and logs from your containerized applications and microservices
CloudWatch Application Insights — facilitate observability for your applications and underlying AWS resources
CloudWatch Contributor Insights - analyze log data and create time series that display contributor data.
CloudWatch Contributor Insights for DynamoDB — integrates with Amazon CloudWatch Contributor Insights to provide information about the most accessed and throttled items in a table or global secondary index
CloudWatch Log Insights — interactively search and analyze your log data in Amazon CloudWatch Logs
Athena Specific:
Use Amazon Athena Federated Query to join data across a wide variety of AWS Services . Always favored for one time queries.
Apache Iceberg table(Athena) — is a distributed, community-driven, Apache 2.0-licensed, 100% open-source data table format that helps simplify data processing on large datasets stored in data lakes
Athena Notebooks — You manage your notebooks in the Athena notebook explorer and edit and run them in sessions using the Athena notebook editor
Athena Partition Projection — can use partition projection in Athena to speed up query processing of highly partitioned tables and automate partition management
Athena and JSON SerDe library to deserialize JSON data
Glue specific:
AWSGlueService Role managed policy — role which allows access to related services including EC2, S3, and Cloudwatch Logs
AWS Glue DataBrew used for data quality management. Recipes are for data transformation. Data quality rules are for data quality.
AWS Glue Data Quality for is for monitoring data quality over time and not redaction of data
Glue Flex — a serverless data integration service that makes it simple to discover, prepare, and combine data for analytics, machine learning (ML), and application development
AWS Glue Interactive Sessions for troubleshooting and development of ETLs
Glue Job Editor’s Data Preview- can be used to determine optimal number of DPUs (Data Processing Units)
Job metrics in AWS Glue — can be used to determine the optimal number of DPUs (Data Processing Units)
Glue Job Run monitoring — is a feature in AWS Glue that simplifies job debugging and optimization for your AWS Glue jobs
Glue Job Profiler — collects and processes raw data from AWS Glue jobs into readable, near real-time metrics stored in Amazon CloudWatch
Glue FindMatches ML — to find matching records across a dataset (including ones without identifiers)
Glue’s Sensitive Data Detection Feature for redaction
Glue for Ray — batch and real-time processing
Glue Sensitive Data Detection Feature — uses pattern matching and machine learning to automatically detect Personal Identifiable Information (PII) and other sensitive data
AWS Glue Studio Job Notebooks -can explore your data and start developing your job script after only a few seconds.
AWS Glue Workflow for automating Glue Jobs(ETL) preferred over Step Functions if only using Glue/S3 services
Redshift Specific:
Redshift built-in audit logging — logs information about connections and user activities in your database.
Redshift Concurrency Scaling for optimum cluster capacity
Redshift Data Sharing — can securely share access to live data across Amazon Redshift clusters, workgroups, AWS accounts, and AWS Regions without manually moving or copying the data
Materialized Views improve query performance
Redshift Streaming Ingestion — provides low-latency, high-speed ingestion of stream data from Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka into an Amazon Redshift provisioned or Amazon Redshift Serverless materialized view
Redshift Data API — does not require persistent connection to the database
Redshift User Defined Function — can create a custom scalar user-defined function (UDF) using either a SQL SELECT clause or a Python program
S3:
S3 Access Grants — map identities in directories such as Active Directory, or AWS Identity and Access Management (IAM) Principals, to datasets in S3
S3 Access Points- simplify data access for any AWS service or customer application that stores data in S3
S3 Lifecycle Policies - a set of rules that define actions that Amazon S3 applies to a group of objects
S3 Intelligent Tiering — delivers automatic storage cost savings when data access patterns change, without performance impact or operational overhead
S3 Infrequent access — for occasional queries
S3 Versioning — means of keeping multiple variants of an object in the same bucket
S3 Transfer Acceleration — can speed up content transfers to and from Amazon S3 by as much as 50–500% for long-distance transfer of larger objects
S3 Select — can use structured query language (SQL) statements to filter the contents of an Amazon S3 object and retrieve only the subset of data that you need
Sagemaker Specific:
Sagemaker Canvas — Build highly accurate ML models using a visual interface, no code required
SageMaker ML Lineage Tracking will help establish model governance
SQS:
maxReceiveCount — the number of times a message is delivered to the source queue before being moved to the dead-letter queue
delaySeconds — the length of time, in seconds, for which the delivery of all messages in the queue is delayed
visibilityTimeout — the visibility timeout for the queue, in second
ReceiveWaitTimeSeconds — the length of time, in seconds, for which a ReceiveMessage action waits for a message to arrive.
Exam Tips
- Default settings in a response is likely not the option
- “Automatically” in the response is likely not an option
- The use of S3 as storage and serverless options are almost always the best cost-effective options
- Responses including non AWS Services such as Zeppelin, Deequ are likely not valid responses
- Polling is never an option
- A response with custom scripts/custom code is never the right choice
- Transfer Acceleration over multipart uploads
- Serverless services are not applicable to real-time applications, Anytime you see real-time requirements, you cannot have serverless components (S3, lambda, firehose, quicksight,etc.)
- Parquet is the best data format for querying (with partitions). Never Avro, ORC, JSON, CSV.
- Any mention of manually triggering or configuring is likely not the right one
- S3/Redshift are usually preferred for data analytics and not DynamoDB (unless it is a low latency query response required)
Certain questions/concepts that you might not have encountered during your studies:
- Redshift Query Editor V2 — primarily used to edit and run queries, visualize results, and share your work with your team.
- With Cross-Region Data sharing, you can share data across clusters in the same AWS account, or in different AWS accounts even when the clusters are in different Regions
- Know how to configure and IDP with Redshift, you run a SQL statement to register the identity provider, including descriptions of the Azure application metadata.
- Know that you enable concurrency scaling by enabling a workload manager (WLM) queue as a concurrency scaling queue
- Know how to prevent a Glue Crawler from creating multiple tables by confirming that the all data files use the same schema, format, and compression type
- Know that AWS Glue Triggers, can be used to start specified jobs and crawlers.
- Know that Athena is the most cost effective method to query S3 and not AWS S3 Select. S3 Select is limited as it only allows you to query one object at a time
Conclusion
You will learn a lot about the core-related AWS services, ingestion, transformation and orchestration of these concepts in AWS. Use this as an opportunity to learn more. Use a video course and practice often to prepare sufficiently before taking the test.
Best of luck on your preparation and hopefully this article helps you to pass the AWS Certified Data Engineer — Associate Certification!