Passing the AWS Certified Data Analytics — Specialty Certification in 2024

Collin Smith
9 min readSep 22, 2023

--

AWS Certified Data Analytics — Specialty Certification

The AWS Certified Data Analytics — Specialty exam is a challenging AWS Certification. This examination covers topics related to Data Analytics within the AWS Environment.

As stated in the AWS Certified Data Analytics Specialty Exam Guide covers the following domains:

  • Domain 1: Collection (18%)
  • Domain 2: Storage and Data Management (22%)
  • Domain 3: Processing (24%)
  • Domain 4: Analysis and Visualization (18%)
  • Domain 5: Security (18%)

Study approach

You can learn more about this certification on the AWS Website . Specifically you can review the Exam Guide for details to be covered. The approach that I generally use is to get a video course of your choosing that will give you an overview of the material. I used Frank Kane/Stephane Maarek Udemy Course to prime myself for the material. You can get coupons for this Udemy course at https://www.datacumulus.com/ website or wait for Udemy sales. Generally, the price should always be less than $30. Go through the course to get an overview material to be covered. Once you have completed this, you should move on to training on practice questions.

This is a good and engaging course to move you through the material to pass the exam.

Practice Questions

Preparing for the exam experience will involve on practicing on question sets to ensure that you can get the right level of comfort with the material before you write the actual exam. I used the following question sets:

Other question sets include that I did not use but you could investigate:

Maarek/Singh AWS Certified Data Analytics Specialty Questions 65 Questions

Exam Topics (Free) (172 Questions)

I browsed through the Exam Topics questions after I took the exam and I think I might them if they are available for the next certification I take. Look like their is some overlap with the Tutorials Dojo questions. It is a bit annoying as you have to type in a Captcha code or press a button each page which can be avoided if you pay a subscription. Actually, their are some legitimate questions here that are not on the other training resources just to let you know.

Practice test training

There is no substitute for training on the test materials. This will help you to familiarize with what you can expect on the exam. It will help you understand your weak areas.

Sometimes I have a piece of paper and chart the days along the x axis and the scores along the y axis something like the following:

AWS Data Analytics Certification Practice Score Progression

Although I have previously done this on paper, the above chart was done in an Excel line chart to show the timeline and progress.

Sometimes I am not sure if the actual training questions are comparable to the continuously updated actual exam. To deal with this, I prefer to practice on different test materials to improve my chances. Also, I try to ensure I actually get 100% on the exam materials in the final run up to the exam.

Tutorials Dojo

Tutorials Dojo has questions that have a pretty good feel to them. When you are starting out the question sets, they can take a considerable time like 2 to 3 hour for the 65 question sets. Gradually, your scores will get better and the time to take them will drop. On the last day before my exam, I got 100% on a sample test(See image below) and it only took me 11 minutes. I believe you might want to take some extra time to ensure you get the best scores you can before taking the exam.

Tutorials Dojo is good as they have randomized question ordering and the multiple choice options are randomized.

At the time of the writing of this article, I noticed for the Review Mode question sets some (Review Mode Sets 1&2) have 70 questions. My small criticism is to not have question sets that are larger than the 65 question actual exam. I would take 5 questions out of the 70 question sets and move them to the exams which have less than 65. It also might be nicer if the Section based question sets had 25–30 questions instead of 12–20.

Whizlabs

Questions do not shuffle the order of the responses so you kind of know by feel after a while. For example, you just get to a question and you know that it is the third one automatically without really looking at it. Although for SA Pro and SysOps Associate they have some randomized Final tests which would be good for this one as well.

Sharing some of the scores along with the time taken for the Whizlabs tests. You do get quicker for sure as you gain familiar with the content:

The test questions sometimes feel a bit weird due to the ESL(English as a Second Language) wording.

But Whizlabs does create a fair amount of value with the volume of questions. I also think they do an actual rewrite of the exams for every new version whereas TutorialsDojo gradually updates the question sets as they go.

Udemy Questions

These are 65 questions that came with the video course from Stephane Maarek/Frank Kane. These questions seem a little dated with questions based on the following topics: Glacier Select, Ganglia, DynamoDB WCU/RCU calculations, EMR S3 Integration with PIG or HBase, Sqoop, Flume, Classic Resize for Redshift, Dblink function.

However it was a set of questions that I did at least once a week and made sure I got 100% on it before taking the final exam.

Personal Tips on questions

My personal tips are tips I used to get decent marks on the exam questions. These are notes I made before taking the actual exam.

Key points

The following are some key points to remember/understand when studying for this exam:

Real-Time or near Real-Time Visualization require OpenSearch/Kibana

Real-Time ingestion requires Kinesis Data Streams over AWS Firehose

Real-Time processes generally cannot have any serverless technologies involved(Glue/S3/Firehose/Quicksight/Athena)

Instance Fleet Groups (allow multiple instance types) vs instance groups(allow auto-scaling)

EMR CloudWatch Metrics: YARNMemoryAvailablePercentage (Memory) vs CapacityRemainingGB(HDFS Disk space), ContainerPendingRatio( ratio of pending containers to containers allocated)

Kinesis Data Streams can be consumed by Kinesis Data Analytics, Amazon EMR, Amazon EC2, AWS Lambda. (DEEL to remember it)

Kinesis Firehose can be consumed by Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, generic HTTP end points, Datadog, New Relic, MongoDB, and Splunk

Preferably only do one COPY Command to Redshift and if possible with a manifest file

Athena(Query S3 and other data sources), S3 Select(Query a subset of data), Redshift Spectrum(Access to S3 and Redshift data), S3 Glacier Select( perform filtering directly against a Glacier object using standard SQL statements)

Redshift Distribution Styles : AUTO, EVEN, KEY or ALL

Windowed Queries include : Sliding, Stagger, Continuous Query

For Redshift distribution keys, always order from highest cardinality to lowest cardinality

But for Redshift Compound keys, always order from lowest cardinality to highest cardinality

PRESTO is an open source, distributed SQL query engine, designed from the ground up for fast analytic queries against data of any size

The following words generally imply an incorrect solution or distractor:

  • Scheduling jobs
  • Custom application
  • Scripts
  • Generally any non AWS technologies such as PIG or HBase, Sqoop, Flume

Additional information to consider/review when studying

These are some items you might want to consider as they were not really covered in the testing materials I looked at. I think the training providers could enhance their materials by including some of the following:

13 Potential Scenarios one could consider in advance of taking the exam:

  1. How do you control access with Lake Formation?

This is done through LF-TBAC (Tag based access control)

2. What are the steps to convert an unencrypted Redshift cluster to have it encrypted at Rest?

See Encrypt your previously unencrypted Amazon Redshift cluster with 1-click

  • Use one-click encryption only when migrating to a KMS-encrypted cluster
  • To convert to a cluster using a hardware security module(HSM), you can create a new encrypted cluster and move your data to it

3. What are the steps to convert a Redshift cluster to be encrypted with HSM?

Open Console, choose Clusters, choose Properties, Edit encryption, choose Use AWS Key Management Service (AWS KMS) or Use a hardware security module(HSM)

See How do I encrypt my Amazon Redshift cluster?

4. How do you deal with Redshift when you get a Specific Endpoint Error such as “Failed to establish a connection to <endpoint>.”?

To connect to the cluster from a client tool outside of the network that the cluster is in, add an ingress rule. Add the rule to the cluster security group for the CIDR/IP that you are connecting from. (See Connection is refused or fails )

5. If you are trying to ingest data near real time into SalesForce what services would you use?

You would use AWS Kinesis Data Firehose and then send the data to an application like Salesforce with AWS AppFlow.

You should not use Firehose and Datastreams, this is a use case to use AWS AppFlow as well. See Exam Topics

6. If a company wants to access the Athena network from the on-premises network by using a JDBC connection and requests cannot traverse the Internet. Which combination of steps should be taken? (Choose 2)

  • Establish an AWS Direct Connect connection between the on-premises network and the VPC
  • Configure the JDBC connection to use an interface VPC endpoint for Athena

See Amazon Athena now provides an interface VPC endpoint, Connect to Amazon Athena using an interface VPC endpoint and Exam Topics

7. If you want to show show word or phrase frequency you should use a word cloud

8. If you have an MSK solution and you are concerned about how to configure data that is coming into one of the topics, which monitoring level will help collect that data?

DEFAULT — free

PER_BROKER — Dimensions relating to Cluster Naem, Broker ID

PER_TOPIC_PER_BROKER — Dimensions relating to Cluster Name, Broker ID, Topic

PER_TOPIC — *** There is no PER_TOPIC monitoring level

PER_TOPIC_PER_PARTITION — EstimatedTimeLag, OffsetLag

See Amazon MSK metrics for monitoring with CloudWatch

9. How can you manage permissions at scale for Glue?

You should use IAM. See Setting up IAM permissions for AWS Glue

10. S3 Object Locking with compliance is good for auditting?

If you have data in S3 and it needs to be maintained for auditing purposes.

How should you protect this data?

With an S3 Object Lock with Compliance mode as a protected object version can’t be overwritten or deleted by any user, including the root user in your AWS account

11. Glacier Flexible Storage provides 3 retrieval options:

  • expedited retrievals in about 1–5 minutes
  • standard retrievals that complete in 3–5 hours
  • free bulk retrievals that return large data sets in about 5–12 hours

12. If an inventory system using Kinesis Producer Library(KPL) and Kinesis Consumer Library(KCL) are receiving duplicated data. Which factors could be causing the duplicated data?(Choose two)

  • The producer has a network-related timeout
  • There was a change in the number of shards, record processors, or both

See Handling Duplicate Records or Exam Topics

13. A healthcare company wants to match patient records in S3 even when the records do not have a common unique identifier. Which solution meets this requirement?

Train and use the AWS Glue FindMatches ML transform in the ETL job.

See AWS Glue now provides FindMatches ML transform to deduplicate and find matching records in your dataset or Exam Topics

Conclusion

You will learn a lot working through this certification. Start with a video course, followed by good practice questions to improve your knowledge and also quicken how fast you can respond.

--

--

Collin Smith

AWS Ambassador/Solutions Architect/Ex-French Foreign Legion