CompTIA Cloud+ CV0-003 – Domain 3.0 Maintenance Part 2

  1. Backups and Restores

Backup and Restore Considerations When you’re looking at using the cloud for either backups or restores, there are some considerations that you really should be aware of. Let’s go ahead and talk about some of those. The first is the SLA. For those that need a refresher, the SLA is essentially an agreement between you and a provider that generally states agreements on what to measure and why, and how to measure backup and restore. That’s the KPIs, for example, how to use and validate results, and who has responsibility when it comes to compliance. Another area of concern or consideration, I should say, is does the backup and Restore service meet any kind of compliance requirements you have? For example, if you are required to retrieve an email in under 4 hours for compliance requirements, which I don’t know of any that mandate that, but again, you need to look at that, can you recover that in time? Another area of concern is that you may have your primary data in the US.

But do you understand that perhaps your data may not be where you think it is? For example, some backup providers allow you to backup overseas. Now, again, that’s perfectly fine for data that might not have compliance requirements. For example, if you have to deal with Sarbanes-Oxley SOX requirements, you need to be very cautious about the regulations around geography and any kind of concerns around being able to recover that data, whatever that requirement is stated as. So once again, just be aware that compliance can be a big deal when it comes to your data. Some other considerations could be backup and restore schedules or configurations of the environment. One of the things I like to emphasise is the importance of data backup. It’s another thing to have to restore a virtual machine to its known state as well. So, in addition to backups and restores, you may need to consider dealing with specific images that are focused on boot configurations.

For example, object management, dependencies, and third-party services are all other areas you have to look at. There are two under each configuration, for example. Golden images can be a concern as well. Again, this is more focused on the private cloud, but one of the areas that you may run into is down the road. Let me just go back to the slide here. Down the road, what could happen? I’ve seen this happen at least once, where the customer needed to do a restore based on a lawsuit.

However, that data was actually backed up to a configuration that wasn’t supported anymore in the private cloud. So what does that mean? You must ensure that not only can you recover your data, but that you can also recover that data to an alternate configuration in some cases. Or you may have to actually just recover that whole configuration—that golden image could be something else you have to recover as well, not just the backup and restore data. Okay, so remember what an SLA is, right? This is probably one of the easiest areas to understand on the exam: what an SLA is. Here’s an exam tip. Make sure you understand that compliance and SLAs will be areas that will be tested around considerations for backups and restores.

  1. Disaster Recovery

Disaster recovery. Well, as the title infers, we’re going to talk about what to do and what to look for after a disaster strikes. Now on the exam, we would of course expect certain definitions to be known, and I want to make sure that you understand what “Dr.” is. Dr. R refers to processes, policies, and procedures needed to recover or ensure continued operations of technology infrastructure critical to an organisation after that disaster has occurred. Now don’t confuse Dr with backups. Excuse me, this is actually different, and we’ll talk more about this in some of the other modules. But just realize that disaster recovery is more of a cleanup, whereas backing up is more of a restoring process. We’ll talk some more about those differences.

What are some benefits of having disaster recovery in the cloud available? Well, generally, if you have infrastructure and that infrastructure is damaged, you have a power outage issue. I had a customer who was literally up in Baltimore at the time of the Baltimore riots. Now, interestingly enough, they did everything that they should have done, and one of the things you just can’t plan for, at least well, is the possibility of a riot breaking out in front of your own office and data center. So that’s an interesting situation.

So with that said, they couldn’t actually get into that area because it was cordoned off by police; it was essentially under martial law. And this organization, which is an educational focused organizational the time, did have a Dr site. However, not everything was replicated as expected, and they also didn’t plan on having power down for over ten days. That was just not something that was planned for. You don’t expect power to go out in a major city for ten days in some areas, and a lot of that was just a result of the instability of the city at the time.

So with that said, Dr. in the cloud has that benefit for when your resources are away from your primary site. Generally, you don’t have to go out and purchase new equipment. You may have to purchase additional software or agents in a lot of cases, but in general, the cloud has much lower overhead when it comes to disaster recovery than traditional Dry approaches. You want to realize that generally, with traditional versus cloud, traditional is going to have a higher cost from a capex perspective because, again, you’re going to have to go out and purchase all this additional equipment. It could be a storage network, a compute network, or  whatever additional software it is, additional software.

You also want to have those skill sets in house. You’ll also need to look at other costs for a Dr site, and the testing could be difficult because, generally, a lot of organisations have a lot of their production and development instances on the same networks, and the last thing you want to do is to do anything that could cause production issues. So this could be a difficulty when it comes to testing. This is just one of the things that could come up with the cloud.

Though you may have lower upfront costs and minimum running Opex now in the cloud, you may need to go out and spin up additional VMs and additional storage. That could cost some additional money. You may have to transfer your data, or when I say transfer data, you may have to look at ingress and egress charges. Depending on what you’re doing, that could be a cost as well. You might not have the necessary skill sets. The cloud provider generally takes care of the backend management, and testing is usually streamlined and automated in a lot of cases. Have a doctor’s plan. Your organisation should have a DR plan. If you don’t, you should consider it, at least if you’re going to be an employee; for any length of time, you should have a Drplan because, again, things will go wrong, and when they do, they’ll most likely call on you. One thing I’ve noticed working in a variety of infrastructure roles is that Dr is usually put on the back burner.

Once again, it needs to be part of your business continuity plan, your recovery plans, and your testing. You need to look at this from not only a single perspective, but also a holistic perspective because with business continuity, backups, recovery, and testing, in some cases, you need to look at everything from a holistic point of view. So there are a few terms you’ll need to understand, and this is where I believe the confusion arises, at least when I teach the course or other cloud courses. Recovery point Objectively, this is the amount of time that passes during a disruption before the amount of that data that’s lost exceeds what’s allowed. That’s called tolerance. Basically, how much data can be missed in a given amount of time, which is usually specified in minutes or hours, As a result, the recovery point objective is how much data can be lost. Is it 15 minutes? Is it 1 hour? If it’s much more than an hour, chances are that the application is in production.

As a matter of fact, if it’s more than 15 minutes, it’s probably not a production app that’s that important. But, once again, I’m not here to pass judgement on what you do in your environments. I’m just sort of pointing out areas that you’re going to see on the exam. And RPO and RTO are two terms you have to know. These have been tested, in my opinion, at least a couple of times. With that said, I wouldn’t say it’s tested heavily, but I want to make sure you get these because if you don’t get them, you’re giving up some points on the exam.

Recovery time Objectively, this is the duration of time, and usually it’s within a service level as well, where these processes have to be kicked in to ensure that the services are up and running. This is how long you can go without services. So this is the time. So RTO can now look at an easy way to remember this—this is the amount of time without services. So if you have an RTO of 1 hour, that means you go 1 hour without having that service available. Whereas if you go back to RPO, this is how much data recovery point, this is the pointing time you have to go back.

This is 50 minutes. A lot of people get confused over this. Once again, I like to call the RPO “tolerance.” The RTO, on the other hand, is concerned with the services not being available. So just be aware; go make sure you get the difference between the two. Because this is where you can definitely miss some of what I would call “give me questions” if you don’t know that. Some other terms you need to know are failover and fail back.Now, failover is typically used with a heartbeat. It’s used in clustering, not exclusively. It could be used in other cases. But generally, you want to know that if failover occurs, if the primary server is somehow not reachable, then what should happen is that the secondary server, let’s say, will be available and will be handling requests. For example, “failing back” refers to the process of restoring something to its original state.

So failover is where generally the system says, “Hey, something’s not right here.” Let’s go ahead and fail over to the secondary. And then, when everything’s nice and wonderful again, let’s go back to the primary. So Dr. Plant should have redundancy because of the safeguards in it. Another area I want to focus on is geoclustering. Geoclustering will now be commonly used to protect against regional disasters. On the exam, there was a question that sort of asked about how you could protect against a regional disaster. Now, I’m not saying this is exactly how the question is phrased. Worded, I can’t say that, of course, but I can give you a hint to make sure you know what geoclustering is and are aware that that should also be part of a DR plan as well. When it comes to SLAs for Dr., remember to understand who does what, when, how, why, whatever. Make sure you get the KPIs. Be aware of what happens when there is a disaster.

What does that SLA essentially specify, right? When it comes to cloud-based Dr, you should be aware of what your provider employs or is capable of. Be aware of how your provider is going to provide you with instructions, guidelines, and best is, and best praAmazon has a robust and, in my opinion, one of the better documentation approaches when it comes to disaster recovery in the cloud. They cover almost every use case. They give you scripts; they give you everything from best practices to white papers. But it’s very specific too. So with that said, look at your cloud provider and look at what they’re providing you for Dr. The process of restoring those processes back to their original state prior to the failure of the primary system is known as “fail back.” So, fail back, failover. Failover is essentially going from the primary to the secondary. In general, failing back occurs when transitioning from secondary to primary. Here’s a study tip: you’re familiar with RPO and RTO. Once again, this was tested more than a few times.

  1. Disaster Recovery Testing

Disaster recovery testing Once again, after you set up your DR plan, it’s important to, of course, test your DR plan. Some organisations are probably better at this than others, but for this exam, I just want to focus on the testable areas, which are basically the testable objectives. On the exam, the first thing is to make sure that you understand what a DR test is and why you want to run that DR test. The goal is to ensure that you validate KPIs, but you also validate your RPOs and RTOs as well.

For example, if you’re expecting an RTO of, say, an hour of downtime, no more, then you need to test for that and validate that as well. Again, that statement is fairly straightforward, but be aware, from a test-objective standpoint, of why you want to Dr test and also different ways to test the Dr plan. The first is through what is called simulation. This is where you essentially simulate an issue. For example, let’s say you have a server that has two paths. You want to validate that if one path goes down, the other path is going to take over the traffic. Well, how would you do that? You go ahead and essentially pull a cable, right? Go to the port, go to the network card—whatever you do, pull it out and see how that server handles the loss of that path. For example, very simple That’s typically considered a simulation cut over. A “cut over” is where you essentially take a service and move it over to another server. For example, you could run the tests in parallel with each other.

And then paper is more administrative, where you just document it; that’s essentially what that is when it comes to best practices. Again, just look at what the best practices are for that resource, for that provider. Like I said before, Amazon’s got a fairly lengthy amount of resources and best practices—SOPs, documentation, whatever—to help you define some of these areas. Costing is one of the areas that you need to look at to understand that, and this sort of sums it up. A lower RTO generally means a higher cost. So if you need, let’s say, ten minutes versus an hour, establishing a ten-minute RTO is going to be more expensive than an hour in most cases.

And it could be the difference in the sense that you have to go out and get different software packages. You may have to get a higher-performing link; you may have to look at distances. For example, with synchronous replication, you really need to be within 60 kilometers. Not exclusively, but as a general rule, 60 km is good. Some vendors say up to 100 km, which is 60 or so miles. It really depends on the vendor solution that you use with the cloud provider as well. So take a look at those areas. Understand why you want to do a DR test. I won’t read this to you again. Finally, one of the things you’d like to know is how to test a Dr. plan in various ways.

  1. Disaster Recovery Techniques

Disaster recovery techniques Let’s go ahead and talk about different approaches to handling your disaster recovery scenarios, your infrastructure architecture, whether or not in the cloud—things you want to think about. One of the first things that you need to do is confirm your wide-area network and service provider requirements and look at your SLA as well. Validate latency, validate bandwidth, and enable compression if possible as well. When it comes to site mirroring, one of the things that you do is mirror your application to another site. Now, of course, this is probably going to be a cost factor for some organizations. Now, if you’re going to mirror your production site to a secondary site, that’s essentially copying 100% of that application to another site. So that’s going to double your cost, at least in most cases. So use replication, and we’ll talk about the different types of replication coming up. We’ll look at ways to improve performance while also increasing availability. So site Mary serves some good purposes for the exam, though.

Make sure you understand that site matching is used to create a copy of your application data. This is essentially good for performance and availability. There are two different types of replication. There are two types of communication: synchronous and asynchronous. Synchronous is essentially copying block by block, whereas asynchronous is going to be essentially delayed. Let’s go ahead and talk about some of the differences with synchronous. This is a zero-data-loss approach. Typically, this is going to be a very small RPO, like under a minute or so. Furthermore, the right is not considered complete until both the local and remote storage acknowledge it.

It has the shortest RPO, but it is very expensive. With asynchronous replication, this is essentially complete as soon as a local storage acknowledges it. Now with some other vendors, they handle it somewhat differently, but overall, you’re going to have a journal volume in most cases, and that remote storage will be updated as soon as that journal reflects that activity. It has a longer RPO, is generally less expensive, and also has a lower load on resources. But the main benefit as well is really the fact that you could also disperse your resources over a much larger geographic area. One of the things I don’t believe is mentioned here is that with synchronous replication, you’re really limited to approximately 100 km, maybe down to 50 or 60 km depending on the vendor.

So you need to look at it from that perspective. So, for example, if you’re in Washington, DC, you could replicate essentially overnight without a problem using synchronous replication. But if you’re going to try to replicate it in Richmond, Virginia, then that may be a stretch. So you need to look at how that could be enabled if that’s what you’re looking for.

So with synchronous, you’re really limited by the geography. If there’s any kind of, I guess, disadvantage to synchronous replication, it’s that you’re definitely limited by geography when it comes to asynchronous. You’re not limited by geography in most cases; you’re limited by the network performance. And that could depend on a lot more factors than that, as well as the tolerance of the application, a lot more factors than that. But in general, there are a couple of things to be aware of. Site Mirroring: Now, for the exam, make sure you know what site mirroring is. And this is essentially you making a copy of your application data on the exam. Make sure you understand the difference between synchronous and asynchronous replication. For this exam, you’ll certainly see a question about replication. Again, understand the differences between the two types.

  1. Business Continuity

Business Continuity Overview Let’s go ahead and discuss areas around business continuity that you’ll see on the Cloud Plus exam. On this specific exam, there’s going to be some terminology you’ll definitely want to know. The first term you should be familiar with is “business impact analysis.” This is a systematic process to determine and evaluate the potential effects of an interruption to its business operations, such as a result of a disaster, accident, emergency, et cetera.

The goal is to be able to priorities applications and provide a sequence as well. Essentially, let’s say, for example, you live in Washington, DC. That area has what’s known as a “blast zone.” Let’s say, for example, that there is a major event in Washington, DC. The reality is that with similar power grids, the power grid will most likely take up the majority of the DC area. With that in mind, you need to look at how you can get your applications up and running in another area. Hopefully, they are not connected to the same power grid. With that said, you need to look at Do I bring up my email first? Do I bring up my CRM first? You need to provide a sequence as well. Business Continuity What is business continuity again? Business continuity is really focused mainly on preparation to ensure that your organisation is going to keep running in case there’s a serious incident or disaster.

Also, you must be able to recover to a functional state in a reasonable amount of time. Now, the business impact analysis is used to gather these requirements for business continuity. Generally, from my experience, a business continuity expert is usually very good at disaster recovery as well. Generally, these areas go together, and performing at BI is a routine process for people that are good at this area. Now, one of the things is, for example, if you’re a virtual machine administrator or storage administrator, you may want to look at your own specific applications and determine areas that you could improve to help assess a failover issue or other areas that could impact your company’s business. For example, when it comes to ensuring the continuity of operations, this is known as “coupe.” This is essentially Coupe’s goal, and there are a couple of ways to look at it.

For example, a lot of my experience came in the government sector for quite a while, and then I got into the commercial sector as well. One of the things I can’t tell you is that the government looks at this very differently than typical businesses. With that said, when it comes to continuity of operations for this exam, you want to be able to understand that a coup really is focused on making sure you can get back to normal operations after a disaster. This is part of what’s called a contingency plan as well. In this case, you have to respond to something.

You look at what’s called a business impact analysis. And then from there, you go ahead and take a look at how those operations would continue in the event of a disaster. Some other terms you’re going to want to be aware of for this exam are going to revolve essentially around different types of disaster recovery and business continuity sites. So we have a hot site, a warm site, and a cold site. Let’s go ahead and discuss them. Hot sites are essentially mirrors of your existing data center. A hot site’s goal is to keep production running in one or more locations. Sync and replication are essentially well maintained and expected. But as you would expect, this is by far the most expensive site configuration when it comes to a warm site.

These are generally used in situations where you need to be up within a reasonable amount of time, let’s say 24 hours, 48 hours, or less. This is where you have, to a degree, the site ready. You may need to do some recovery, and you may also have equipment there, but it may not be fully ready. What I’ve seen is a warm site, which is often confused with both a cold site and a warm site. And I’ll talk about what a “cold site” is. Now the reality is, if you’re not able to get a warm site up in under two days and it’s really a cold site, I just don’t see how a company would survive past two or three days of not having operations. But that’s just my opinion. Now, for the exam, you also want to know what a cold site is. This is generally an empty room or an empty building. There will be utilities, but there’s generally no equipment there. There is no desk, and there are no racks of servers. There may or may not even be any telecom. Now, a cold site is pretty cold in the sense that there’s really no activity going on there. In many cases, it will take months to get a site up and running if it lacks telecom or any kind of power.

So cold sites are not really effective. Mitigating. The only time a cold site would be useful is if your production site was completely destroyed, such as if you were in downtown New Orleans during the floods. Again, that’s a situation where a cold site could help, but it’s not going to be up and available for a very lengthy amount of time. Anyone who’s tried to install telecom or get links installed knows that this is not a quick process. Business Impact Analysis Make sure you know what that is for the exam. Make sure you understand that it’s a systematic process to determine and evaluate the potential effects of an interruption to critical business operations that are generally a result of a disaster, accident, or emergency. Here’s a test tip: make sure you understand the distinctions between various types of business continuity and disaster recovery sites.

Remember, a hot site is a site mirror. Generally, a warm site is going to entail an empty room, generally one that has utilities and telecom pretty much ready to go, but the equipment just isn’t ready to go. Whereas a “cold site” is an empty room with no utilities, no telecom, no anything. So for this exam, one of the things I did want to point out is they’ll ask you, usually based on more of a scenario question, if anything like customer A is looking to do this, what kind of site would you like to recommend to them? And you’ll need to understand when to specify a hot site, a warm site, or a cold site. That’s essentially what I’d like you to know for the exam.

  1. GCP Resources Demo

Welcome back. Let’s go ahead and talk about resources on the Google Cloud platform, specifically managing resources on the Google Cloud platform. There are a lot of things to really understand and correlate for the exam around resources. And so some of the questions you’re going to get will be mainly focused on what resources go where and whether you can migrate this resource from this zone to that zone or from this region to that region.

So let’s discuss the three types of resources in the Google Cloud Platform. The first is global, the second is region, and the third is zone or zonal. So here’s a good diagram that shows you how the hierarchy is in GCP. So if you note over here, you have global resources, and under the global resources are regional resources, and then you have zonal resources. So if you go back to the previous few modules, you’ll see the GCT networks and regions demo I did that showed you how that’s all lined up. So do take another look at that and make sure you understand it.

Now some of the questions on the exam will ask you if you can move resources from this region to this region, let’s say. And again, it all depends on the type of resource. So you can’t move a disc from one zone to another, but you can move an image that’s different. An image is a global resource. So you need to understand the types of images and all that, as far as not just images but resources like networking and everything like that as well. And let me turn off my email. Now, in terms of resources, let’s go ahead and confirm a few things. So you see here that you have global resources, so images, snapshots, and networks are all global resources.

And then if you have something like an external IP address, we’ll talk about external and internal IPS because they’re very different and there’s a lot of confusion over that. So, for example, an external IP address has to be correlated to an internal IP address, and so we’ll talk more about that in the networking sections, and then zones would be instances and discs as well. And you can see that you’ve got it correlated to a project, so you have what’s called the physical organisation and the logical organization. So let’s go over and talk about what a global resource is. Global resources are accessible by any resource in any zone within the same project. Remember that when you create a global resource, you don’t need to provide a scope specification for any of these. So, again, you create an image, or if it’s a Google Predefined image, that can be imported to another region or zone snapshots, so you can recover a snapshot from anywhere. And then also to the virtual private cloud network, which is a global resource as well. But again, subnets are regional. Now on the test, rest assured, you will see a question that is going to ask you about moving virtual machines, moving networks, and a few other resources, and you need to know what those are and whether you can move a sudden net from one region to another. and the answer is no. Again, you could recreate that, but you can’t just move it, unlike an image or something of that ion to another. So firewall rules again apply to the network, but again, they’re still considered a global resource. Routes and then global operations. I won’t read all these to you.

Again, just take a few minutes of your time before you take the test to make sure that you understand the differences between resources. It’s very important. Now, as a regional resource, this is going to be a resource again that is in a specific region. So, in the Americas, for example, if you have a regional resource, it stays in the Americas. You can’t simply relocate it to Asia-Pacific, for example. So what are some examples of regional resources? So again, you’ve got addresses. So if you have an external IP address, that’s a static IP. In GCP, for example, if you need to route outside of your GCP instances, you must obtain an external IP address. And that IP is essentially considered a regional resource. subnet, again, subnet and regional operations, and a zone resource. So, for example, if you’ve got something hosted in Iowa, you just can’t move it to another zone, even if it’s in the same region. And what are some of these resources and instances? So again, an example would be a virtual machine instance or a disk. So that disc is essentially a physical resource that you’re reusing in a lot of cases, and you just can’t attach that to another instance in another zone. Machine types. Machine types are zone resources. And lastly, again, operations.

So let’s go ahead and talk about quotas. Now, quotas are essentially important to understand in the sense that a quota is used to protect not only you but also other customers as well as Google. The goal of a quota is to make sure that you don’t have a rogue user, or just someone not paying attention, using up more resources than what is really needed. In a lot of cases, this should prevent runaway consumption and billing spikes and therefore enforce sizing conditions or consideration. So, how do you check your quota? You go over to the console, or you go use Gcloud as well. Command has arrived; you do not need to know anything prior to the test, only FYI labels. Now, a label is another area where there was some confusion at first because labels and tags were, as I recall, not well thought out, which is a good way to look at it because, once again, many people confuse a label and a tag as the same thing. In reality, there are actually two different things.

But again, initially, when he created a tag, it created a label. And then, if you create a label, a tag is created as well. And so there’s still a lot of confusion over what a label is. So what is a label? It’s a utility for organising your GCP resources. So, for example, you want to attach it to, like, a VM, a disk, a snapshot, et cetera. And the reason is that you want to be able to find something quickly. You want to be able to define billing. You want to be able to see who’s doing what. Excuse me. So again, that’s essentially what a label is for. So let’s go over to the console here and talk more about the labels and resources in GCP.

img