Amazon AWS Certified Database Specialty – Amazon DynamoDB and DAX Part 3
In this lecture, we’re going to talk about the DynamoDB partitions. This is going to help you understand how DynamoDB works under the hood. So what exactly is a DynamoDB partition? Now, the partitions store the Dynamo TB table data physically, all right? The partitions are actually ten gig SSD volume volumes, which physically store your table data. Now, you should not confuse this with your tables partition key or the hash key, because the partition K is a logical partition, whereas the DynamoDB partition is a physical partition. It’s SSD volume that’s sitting somewhere in the AWS data center. And one physical partition can store multiple partition keys.
A physical partition can have multiple logical partitions based on the partition key value. And of course, a table can have any number of partitions. And the number of table partitions your table will have will depend on its size and the capacity that you provision on that table. Now, how many partitions to use is something that’s decided internally by DynamoDB. You don’t have a direct control on that, but your choices of provision capacity do affect the number of partitions that get created internally by DynamoDB. And we’re going to look at how that happens in this lecture. So here we have three partitions and this table has provision capacity of 600 WCUs and 600 RCUS. So each table is going to receive a third of capacity. So the provision capacity is evenly distributed across the available table partitions. So, since we have three partitions, each of them is going to receive about 200 WCUs and 200 RCUS. And another super important thing to remember is partitions once allocated, do not get de allocated ever.
So if DynamoDB creates three partitions for your table, then even if you change the provision capacity later, even if you dial down your provision capacity, the number of partitions are still going to remain at three. They can increase, but they cannot decrease. All right? Now, how do you calculate the number of partitions? One physical partition can only support up to 1000 WCUs or 3000 RCU. This is the maximum throughput any partition or any ten gig hard disk of Dynamo TB can support. And again, you know, the size limit is ten gig per partition.
So you can look at this diagram per single SSD volume maximum it can store is ten GB of data and maximum throughput that it can provide, you will be about 3000 RCUS or 1000 WCU. So if you need more throughput, you will get more partitions. And if you need more size, about ten GB, you will get more partitions. So the number of partitions will be either the number of partitions based on the throughput or the number of partitions based on the table size, which error is higher. So a simple formula for calculating the number of partitions would be something like this.
PT stands for partitions based on throughput. So you simply divide the RCS by 3000 and WC is by thousand. You total it up and you round it up to get the number of partitions based on throughput. Similarly, based on size, you simply divide your storage in chunks of ten GB. And then you round that up and max of the two will give you the number of partitions that DynamoDB might use internally for your table. Now, DynamoDB does not disclose the number of partitions, but you can estimate the number of partitions using this simple formula.
So let’s take an example to understand how this partition affects table performance. Right, so we have a table with a provision capacity of 500 RCS and 500 WCS, all right? And our storage requirements are under ten gigs. So let’s calculate the number of partitions. So number of partitions based on throughput would be something like this. 500 RC is by 3000, plus 500 WC is by 1000. So that comes to about zero 67. And when you round it up, you get about one partition. All right? So our data can be stored in a single partition and it can deliver us the required throughput of 500 WCUs and 500 RCU. But now, let’s say we scale up this table with a new provision capacity and we set the table capacity at 1000 RCS and 1000 WCS. Now, what would be the new number of partitions?
Let’s see. So thousand by 3000 plus thousand by thousand. So that comes to around 133. So that will result in two partitions. So DynamoDB is going to create two new partitions and segregate data into it. So each partition will receive equal amount of new provision capacity. So half of RCUS will go to each table and half of WCUs will go to each table.
So each table is going to receive 500 RCUS and 500 WCS. So when you scale up, you’re going to increase the number of partitions. So remember that when you scale up the capacity, you might end up increasing the number of partitions that get allocated to your table. Now, let’s see what happens when you scale up the storage. Okay, so we have provision capacity of thousand RCUS and 500 WCU as an example. And your storage requirement currently is five gigs. So the number of partitions based on throughput would be zero 67. And rounded up will be one partition. And let’s calculate based on size. So you divide the available storage size by ten gigs. So you get about 0. 5. Again, with storage.
Also you need about one partition. So final number of partitions that you need is maximum of the two values. One or one will be one. So with five gigs of storage and with 1000 RCUS and 500 WCUs, you’re going to need about one partition. Now, let’s say we scale up the storage and new storage size is about 50 gigs. Now again, let’s calculate the number of partitions. Partitions based on throughput remain the same because we have not changed the provision capacity. And partitions based on size will now be about five because we have increased the storage size to 50GB. So 50 gigs by 10 gives us five. So maximum of one and five is five.
So total number of partitions now will be five. So your table will now have five different partitions. So you’ll have it like this, and each partition is going to receive a fifth of the provision capacity. And now this is very, very important to note. So now each table is going to receive about 200 RPUs and 100 WC use. So do you notice that? So now your table is going to perform much lower than what it used to perform before. So to get the same level of performance, you’ll have to increase the provision capacity about five times. And that also means that you’ll be increasing your costs about five times. Again, if we now increase the storage size to 500GB now that’s going to result in a massive increase in number of partitions. You’re going to need about 50 partitions and now each partition is going to receive just about 20 RCAs and ten WCS.
That’s about one 50th of the provision capacity. So look, if you want to get a performance throughput of 1000 RCS and 500 WCS, then you will have to increase the provision capacity by 50 times. It comes to somewhere around 50,000 RCUS and 25,000 WCU. So partition behavior is very important to understand. And remember that once you increase the number of partitions, once you increase the number of allocated physical partitions, they will not be deallocated again. And if you want them to be deallocated, then you have to create a new Dynamo TB table.
In this lecture, we’re going to look at the Dynamo TB scaling. So let’s understand what scaling options we have when we work with Dynamo TB. So you can manually scale up your provision capacity as and when needed. So, whenever you want to scale in Dynamo TB, you simply up your provision capacity. And that’s how you scale. You can scale up anytime you want, but you can scale down only up to four times in a day. And in addition, one scale down is available to you. If you do not scale down in last 4 hours, effectively you get about nine scale downs per day.
And as we have seen earlier, scaling affects the partition behavior. So if you increase your provision capacity, you might end up increasing the number of partitions on your table. And you already know that the provision capacity is evenly distributed across all partitions. So if you end up increasing the number of partitions, then your throughput will reduce by that many number of partitions. If your number of partitions increase from one partition to two, then the provisioned throughput will be distributed between these two partitions. Effectively, your table will receive only one half the provisioned capacity. So that’s super important to know. And we have already talked about this, and I’m just repeating it because it’s so super important that any increase in partition on a scale up will not result in a decrease in number of partitions on scale down. That simply means that once a partition has been allocated, they do not get de allocated later. All right.
And now let’s look at the auto scaling in Dynamo TB. Dynamo TB uses AWS application auto scaling service. For auto scaling, there are no additional costs. You only pay for the capacity that gets provisioned. So it’s very easy to use auto scaling. You simply set the desired target utilization and minimum and maximum provisioned capacity. So, for example, you set a minimum capacity of ten RCAs and maximum of 500 RCAs. And let’s say you set a target utilization of, say, 80%. Then, whenever your consumption increases or decreases, dynamo TB will automatically scale the provision capacity as per your usage or as per your consumption pattern.
There are two important things to note here. That scale up happens when utilization goes above the percent target utilization and remains there for at least two consecutive minutes. So if you have set your target utilization, say at 80%, and you’re consuming about 80% for at least two minutes, then DynamoDB is going to increase your provision capacity. Similarly, the auto scaling will scale down your capacity when your utilization falls below percent target utilization -20% for example. If you have set the target utilization at 80%, then when your utilization falls below 60%, which is 80 -20, so when it falls below 60% and stays there for at least 15 minutes, then Dynamo TB will scale down your capacity. And remember that small bursts like you see here with the Black Arrow. These often get tolerated because we have already seen that we have burst capacity and adaptive capacity which kick in to tolerate or to accommodate these bursts.
So now let’s quickly look at the scaling and partition behavior. So say we want to migrate a fiveGB SQL database to DynamoDB and provision capacity needed during regular business hours is 1000 RCUS and 500 WCU. So Bau is business as usual. So during the regular working of your business, you just need about 1000 RCUS and 500 WCUs. All right? So based on this information, the number of partitions that we require is just one partition. All right? But transferring so much of data from a SQL database to DynamoDB is going to take a lot of time.
You decide to speed up your migration and therefore you scale up your right capacity temporarily to 5000 WCS. Originally it was about 500 WCS and now you’re scaling it up to 5000. So about ten times you’re scaling the right capacity. So the new number of partitions based on throughput is six. So DynamoDB is going to allocate about six partitions for your table. So it will be something like this. Each partition now will receive about one 6th of the provisioned capacity. So it will receive about 167 RCS and about 833 WCS. So essentially you can see that your WCU have increased from 500 to 833 and your RCUS have decreased from 1000 to 167. But during the process of migration you are only interested in WCU. You have effectively increased the number of WCUs available for your migration and hence you are essentially speeding up your migration.
Right? WCUs have increased from 500 to 833. The important catch here is to see what happens post migration. Now, post migration, we scale back the capacity from 5000 to 500 WCS. Now, when you do that, pay attention to what happens to the new right capacity. The new write capacity per partition will be one 6th of the new provision capacity. So one 6th of 500, that will be just about 83 WCU. So WCUs per partition has dropped from 833 to 83. And now that’s going to result in a lot of performance bottleneck for your DynamoDB table. So remember that always use the capacity that you need during your regular business operations. You should not increase your provision capacity temporarily to get small time benefits. Because now to get from 83 to 500, you will have to increase your provision capacity six times. And that’s going to increase your costs as well by six times. This is really super important to know when you work with Dynamo TB tables. All right, let’s continue to next lecture where we’ll look at some of the Dynamo TB best practices.
So in this lecture, we’re going to look at some of the best practices when working with Dynamo TB tables. The first one is efficient key design. So you should choose your partition key such that it has many unique values. If you have very few values, then you should either choose a different partition key or you can consider consider ride sharding approach that we discussed previously. Then you should distribute your reads and writes uniformly across partitions to avoid hot partitions. Remember that when you have hot partitions, adaptive capacity kicks in to help you, but you cannot always rely on it. You must ensure that you distribute your reads and writes uniformly across all the partition key values. Then you should store hot and cold data in separate tables as far as possible.
So this helps you to uniformly distribute your reads and writes within a particular table. And we have discussed this previously as well when we did our hands on that. You should know your query patterns in advance when you design your table. Otherwise you’re going to need to use the scans and filter operations. And scans and filter operations are very, very expensive, right? And then we should also choose sort key, depending on your application’s needs. Sort keys. For example, in our gameplay table, we used the sort key on the game ID that allowed us to get all the games played by a user. And we could also have a sort key on the game timestamp that allowed us to get historical data of a particular player.
Then you should use indexes based on your applications query patterns. And that’s what we discussed right now. So use your indexes wisely based on the query patterns. So before you actually create your tables, you have to think through your query patterns and understand which are the query patterns or access patterns that my application is going to make on the DynamoDB table. And you should create your indexes accordingly, so you don’t have to use the scans and filters. You should use a primary key or LSI’s when strong consistency is needed.
Because remember, global secondary indexes do not support strong consistency. When you create your table and your access pattern requires strong consistency, then you should accommodate that within your primary key or within your LSI. You use GSIS for finer control over throughput, or when your application needs to query a different partition key. So you know that GSIS have their own provision throughput. So if your use case requires a finer control on the throughput, then you should consider using GSI and use shorter attribute names. So remember that when you count the storage of DynamoDB, it also includes the item name. So DynamoDB stores data in JSON format, and you know that JSON format is a key value pair format. So when your DynamoDB data is stored on the physical partitions, it’s actually stored in JSON format. So for each item, you also need space for the attribute names and the attribute values. So you should use as smaller attribute names as possible, but also make sure that they are intuitive. Otherwise it will be an issue for your developers. All right.
Then for storing larger item attributes, you should consider compression. You can zip your attribute values to reduce their size. Or you can also use S Three to store large attributes. For example, if you have media files like music files or video files, instead of storing that data into your DynamoDB table, you can simply store that in S Three three, and you can store its path into your DynamoDB table. So that will help you reduce the size of your items. Another approach would be to split larger attributes into multiple items, so that will help you store larger data into DynamoDB. Now, when reading a DynamoDB table, you should always make it a point to avoid scan and filter operations, because these are very costly. When you do a scan operation, it scans the entire table, or it runs across different partitions. You don’t use a partition key when running a scan, and that results in a performance bottleneck. And also it will be very expensive operation in terms of the costs as well.
And when you use filters, now, filters are put on top of your query operations where you make a query. And once you get the result from that query, you use filters to filter the data from the query. So essentially, you pay for the query and you actually get a subset of the output using the filter operations. That’s the reason why filter and scan operations are a little expensive, so you should avoid using them. The best way to avoid scans and filters is to design your table with your future access patterns in mind. And you should use eventual consistency for rates, unless you really need strong consistency. And this is going to save you at least one half the money that you spend on your DynamoDB table. Now, when creating or working with your local secondary indexes, you should use LSIs very sparingly. You should really see if you really need these LSIs, because remember, LSIs consume the same capacity as your table, and they are going to put a load on your underlying table.
You should always project fewer attributes on LSIs because they consume space and they are going to result in duplicated data. And unless you need those attributes, you should consider not projecting them. The keys are always projected, but non key attributes should be projected only if you need them in your application. Then you should always watch for expanding item collections. So by item collection, I mean the size of a partition, a logical partition, okay? So if your partition size is going beyond ten GB, then you cannot store any more data on that logical partition. So you should always watch for this. And that’s why choosing the right keys is super important when you work with DynamoDB. Then when working with Global Secondary indexes, again, the same rules apply. You should project fewer attributes to save on your storage cost.
But remember that do project the attributes that your application needs otherwise you’re going to require additional read operation to fetch the data that’s not there in the index. And remember that you can use the GSIS to create eventually consistent read replicas of your DynamoDB table. So, for example, if you create a global secondary index with the same partition key and same sort key as that of the table, you are essentially creating a read replica of your DynamoDB table. So this is one of the ways you can create a read replica of your table. But remember that it will be eventually consistent, because GSIS do not support strong consistency.
Now let’s look at storing larger items. Now, we have already discussed this briefly, and I’m going to just spend a minute or so to go through this. So DynamoDB supports item sizes up to 400 KBH. So this is the maximum limit on each item, okay? And this includes the attribute name and the attribute value. Because DynamoDB stores data in JSON format for storing larger items, you have a couple of options that we have already discussed compress large attribute values or store them in s three and store the corresponding path in DynamoDB. So here we store the path of the file in DynamoDB, and the file is actually stored in s three. So the size of the DynamoDB table will be considerably reduced. And when you read the item from DynamoDB, your application can then query the s three bucket and get the data from the s three bucket using the image location stored in DynamoDB. So how it works is here you have a client and your DynamoDB table.
So when you write data, you simply write it to s three, and the metadata or the path to the s three file will be sent into the DynamoDB table. And when you read this, the client will fetch the metadata from DynamoDB. And then using that metadata or the path to the file, then the client can retrieve the large message or the image file from s three, right? On a similar lines, you can also index s three objects metadata in DynamoDB. It works something like this. You write your data or your files to s three, and then you use a lambda function to write the metadata in DynamoDB. And then you can use the DynamoDB API to get the object metadata. Now this makes your metadata searchable. So you can search the files by date. You can look at the total storage used by a customer, or you can list all the objects based on certain attributes. Or you can find all the objects uploaded within a particular date range. This helps you create more searchable index for your s three objects.
This is one of the use case you can use with DynamoDB. Now, some of the table operations. Okay, so let’s say you want to do a table cleanup. What options you have? If you want to clean up your table, the option that you have is read each item and delete it. So you scan the entire table and delete the items. And this is a very, very slow approach. It is expensive because it consumes your capacity units. All right? So a faster approach would be to simply drop the table or delete the table and recreate it. So what you essentially do is you simply delete the table and you create a table with the same name. This is not going to consume any arts user.
This is very efficient as well as inexpensive way to clean up your table. Now, if you want to copy a DynamoDB table to a new DynamoDB table. What do you do? All you can do is create a data pipeline. It’s something like this. You have a data pipeline. The data pipeline launches an EMR cluster, and it helps you with exporting data from DynamoDB tables. EMR cluster will read the data from your DynamoDB table, and it will write it to an S three bucket.
Now, create another data pipeline for the import job, which, again, launches an EMR cluster that reads the data from your S three bucket and writes to the target DynamoDB table. So this is one way you can copy data from one DynamoDB table to another.
And second option is to create a backup of your source DynamoDB table and restore that backup into a new table. And, of course, this will take some time, because backup and restore operations do take a while to complete. And there is a third option is to scan the source table and write the data to the new table, of course, using your own application logic. And again, this third option is going to consume the capacity units on the source as well as on the target table. All right, that’s about it. And let’s continue to the next lecture.
Popular posts
Recent Posts