Google Professional Data Engineer – BigQuery ~ Hive ~ OLAP Part 2
At the end of this lecture, try and answer this question how would you access nested JSON fields when data with this kind of nested JSON is loaded into BigQuery? In this lecture, we’ll see how we can load nested JSON data into BigQuery with or without specifying a schema. We’ll start off in the BigQuery web console. Let’s create a brand new data set within our test project. We’ll call the data set person data. We located in the US.
Click on OK and you’ll find one more data set on the left baby names. And now person data. You want to bulk load data in a table under this Person Data set instead of CSV, though we want the data to be specified in JSON. I’m going to use a sample data set that Google makes available to us at this URL cloud. Google. com BigQuery docspersonsdata JSON I’ve downloaded this file onto my local machine, and now when I switch over to BigQuery, I’m going to use this UI to upload this JSON file. This JSON file comes along with a schema which is also specified in a JSON file.
For now, though, I’m only going to focus on Persons Data JSON, the second file in the list that you see on screen. Once you’ve chosen this file, specify that it’s a JSON file in the UI, this is a JSON file which is new line delimited. That means different JSON entities are separated by a new line. The destination table into which I want to load this data.
I’m going to call Person Data underscore Auto because I want the schema to be automatically detected. So click on the check box which says automatically detect the schema for this JSON data. Go ahead and hit the Create button in order to create the table and bulk load data into it. If you navigate to Job History from your left navigation pane, you’ll find that Person Data Underscore Auto is in the loading phase.
You can see that in this Job History view. All the jobs that we’ve run so far on BigQuery are listed here. We uploaded a file into the Baby Names table, and now we’ve done that for Person Data as well. This Person Data JSON file had nested fields within it. Let’s see how these fields were treated by our automatic schema detection.
Each person there was a field called Cities Lived, indicating those cities where that person lived for certain periods of time, a person may have moved several times in his or her lifetime, which is why there were multiple cities in which a person could have lived. For each city, the JSON file contains information about how many years that person spent in that place. The city’s lived column in BigQuery is a repeated record. Essentially, you can imagine it as an array of structs.
Within each struct, there is the information of years lived and the place at which that person lived. Gender is a standalone field. It’s not nested. So it’s a simple string data type that’s what was detected by BigQuery. The children that a person has is once again a nested field. A person can have any number of children, and we have information about each child’s age, gender, and name. Notice how individual fields within a nested field can be accessed by the dot operator. The number within the phone number field is phone number, dot number, and phone number area code. To access the area code, a person is assumed to have just one phone number.
This is not a repeated field. If you click on the preview tab, you can see how these nested fields look within BigQuery for cities lived. You can access the individual fields using the dot operator. Cities lived, dot years lived, and cities lived dot place. Because cities lived is a repeated record or an array of structs. That’s the best way to imagine it. You also have to have an index to access one of the structures within the array. Remember that when we loaded this table, we asked that the schema be automatically detected. It’s possible that BigQuery didn’t do a good job of auto detecting your schema. Notice the message you get in the yellow butter bar at the top.
You can rerun the same job and specify your own schema for this particular file. This is the option you might want to choose if you feel that the automatic schema detection is not up to your standards or not the way you like it. Clicking on that link will allow you to edit the schema in place. Let’s go back to our Create table UI and see some other interesting stuff within it. Notice there is an option to run the same job that you just ran. If you repeatedly want to load data into a particular table, you’ll simply hit Select Previous Job and run the same job again. If you click on this button, it will give you an option as to which previous job you want to rerun.
Do you want to reload the baby names table, or do you want to reload your person data table? I’m going to rerun the job which loads the person data table, but once I have my UI prefilled for me, I’m going to make a few changes. I have to choose the file from which I want to load data. Once again, I choose person Stata JSON as I did before, but this time I do not want the schema to be automatically detected. I have a schema file which I want to use to specify to BigQuery how the data should be laid out. Click Edit as text on the bottom right and you’ll see a text box which allows you to specify the schema. We’ve used this text box before, but this time the schema is going to be some JSON schema. Google has specified the corresponding JSON schema for this person’s data.
You can access it at this URL that you see on screen cloud google. com BigQuery docspersons data schema JSON notice the format of this JSON schema. For every field we have the name of the field, the mode of the field, whether it’s null able or not, and also the data type for that field. Copy and paste this schema file into the text box where your schema can be specified on the web console. I’m going to set up a new destination table for this upload.
I’m going to call it Person Data underscore schema. We are now ready to create and load this table, and once the job has finished running, you can check out the table details. The schema looks exactly the same as it did when we auto detected it, which means BigQuery did a pretty good job. If you look at the preview pane, though, you’ll notice it’s kind of arranged a little differently. The order of the columns is different. Everything else is the same. This lecture should have shown you that the way to access nested fields in BigQuery is by using the dot operator.
At the end of this lecture, you should be able to answer this pretty important question how can you reduce the amount of data processed by BigQuery? We’ve spoken earlier that you get charged based on how much data BigQuery processes. Reducing this is definitely a priority, especially for huge data sets. In this video, we’ll see how we can access many, many interesting public data sets on BigQuery. Google has made a number of very interesting public data sets available on BigQuery for free. The data is stored for free on Public Common infrastructure. You only pay for the queries that you run on these data sets. You can see a list of public data sets available on BigQuery at this URL cloud google. com BigQuery Public Data these public data sets are great for data science enthusiasts who want to run their models on real world data. For example, here is a data set which covers the Bay Area bike share trips.
BigQuery also has data from GitHub, over 2. 8 million open source repositories along with commits, and a whole bunch of other contributor information is all available to you to be queried. There’s also weather data that we’ll be using in one or more of our labs from the National Oceanic and Atmospheric Administration. And here is a data set that I’ve used really often the NYPD Motor Vehicles Collisions data. All collisions recorded since 2012 over the five New York City boroughs. You can see what public data sets are available on the BigQuery console as well. Just go to BigQuery cloud google. com and click on public data sets on your left navigation pane. If you expand the public data sets, you’ll see a whole bunch of them available right there.
Remember that these are stored in public storage on the Google Cloud platform. You don’t pay for storage, only for the queries that you run. Let’s explore a few tables within these public data sets and see how we can run queries on them. The one I’m going to choose is in BigQuery public data. That’s the name of the project colon Samples is the data set, and within that there are a whole number of tables. Notice there are tables for Githubub GSOD, which is the weather data netlity, data Shakespeare, data Trigrams, Wikipedia, et cetera.
I’m going to choose one at random. Let’s look at Natality Data, which talks about baby births in the US over a certain time period. You can just check out the column details for this particular table. For fun. There is the year in which the baby was born, the month, day, the state, the race, the weight and pounds, and so on. We’ll run our first query on this table. We want to find the weight in pounds, the state year, and gestation weeks for the babies. In this particular table, we want to order by the weight in pounds, and we just want ten results.
Notice how you access the specific table in our public data sets BigQuery Public Data Samples natality, you’ve specified this query exactly like you created this data set and populated the table on your own. You simply run this query and you’ll get the results. It’s pretty awesome to have all these cool data sets right there for you, readymade for you to work with. Let’s explore another table from the left navigation pane. This time we’ll look at Shakespeare’s works. Click on the table on the left NAV pane and view the details. This table contains every word that has occurred in Shakespeare’s works, the number of times that word has occurred, the corpus or the work from which this word was extracted,
and the date the work was published. Let’s run a query on this table. We want to find the total number of words for each work of Shakespeare’s. So we sum all the words for each corpus. It’s a simple aggregation SQL query with a group by clause. We access the BigQuery Public Data Samples Shakespeare Table the query parser will validate whether the syntax of this query is correct. When it has finished validation, it will show you a check mark on the bottom right for a valid query or an exclamation mark for an invalid query. If you look at the results, you can see that sonnets contain the maximum number of words and amongst his longer works, King Henry I has the most number of words. Some lectures ago when we started with BigQuery, we worked with the Baby Names data set that we’d created with the Baby names table that we had uploaded using some CSV data. This data is actually freely available in a public data set on BigQuery. You can go to BigQuery public data USA names. There’s a table there USA underscore 1910 underscore Current, which gives you all the baby names from 1910 up to the current year.
This data set has the state in which the baby was born, the gender, the year of birth, the name and the number of occurrences of this name. Let’s run a query to find the most popular baby name across genders from 1910 till today. We’ll limit the results to just ten. So these are both boy and girl names combined to find the most popular for all these years put together, you’ll find that James is the most popular name. John comes after Robert, Michael and then William. You have to scroll to the next page in order to see the most popular girl name. Let’s see if you can guess it. It’s Mary. You can simply change this query a little bit. The descending you make ascending and you’ll get the least popular baby names across genders from 1910 till today.
So you’ll see the most unusual names. These least popular names are also the ones that are more unusual or rare. Some examples are Jamari, World, Carl, Romana, Nazir and so on. Now that we are a little comfortable with the query, let’s look at what affects the amount of data that is processed when you execute a query. If you look at the select query, we’ve selected three columns the number, name, and gender column, and this query processes 105 megabytes of dev data when run. The internal structure of BigQuery is some kind of columnar format, so columns of data are fetched together. If you reduce the number of columns that you select, you’ll reduce the amount of data processed.
Data processed does not depend on the where clause that you specify. Instead, it depends on the columns that you want to project. Here I’ve reduced one column, and my data process has immediately gone down to 88. 4 megabytes. Let’s change this query once again.
We’ll add a bunch of more columns. Now we are selecting four columns, and the amount of data process has gone up to 150 megabytes. This columnar format of BigQuery essentially implies that the entire table is not scanned. When queries are run, only those columns that you specify in the select clause are scanned. This question should now be super easy for you to answer. If you want to reduce the amount of data processed by BigQuery, simply specify fewer columns in your select flaws. Only choose those columns that you’re really interested in. Don’t just add columns, because that will just increase your processing costs.
At the end of this lecture, you should know how you can load data from cloud storage into BigQuery and store data from BigQuery into cloud storage as well. In this lecture, we’ll study how you can access BigQuery using the command line. You can run the BQ command on the command line in order to perform BigQuery manipulation. BQ Help will show you all the manipulations that you can perform on the command line. You can say BQ shell, BQ Show, and all of these commands have their corresponding descriptions right here. If you want more help on a particular command, you’ll type BQ Help followed by that command. For example, BQ Help MK will give you more help on creating a data set, table or view with a particular name. If you run BQ Help Show, you’ll get the description which says that Show gives you more details about a particular object, whether that object is a data set, it’s a table within a data set or a view. The Bqls command lists what data sets we have in this particular project. So far, we’ve just created the baby names and the person data set.
If you want to run BigQuery commands in an interactive mode, you can simply type BQ Shell and you’ll get into the BigQuery shell right here on the command line. Notice how the prompt is different. We are now in the BigQuery shell for my test project. You can run a host of BigQuery commands here in interactive mode. Any commands that you specify need not be prefixed by the BQ keyword. Just leave that out. You can simply say MK baby Names to create a new data set called Baby Names.
Simply typing Show in here will show you help for this command show followed by the name of a data set. For example, Person Data will give you a bunch of information about this data set. You can see when this data set was last modified. You can see the permissions that are associated with this dataset. You can also see whether any labels have been applied to this data set. Simply type exit to get out of this interactive shell and back to your command line.
The BQ command on Cloud Shell did all the things that you would expect it to. But how did it know which project you want to work on? Which project to run? This BQ command on. This is specified in the tilde forward slash BigQuery RC file. This is the configuration file that is associated with BigQuery, and within this file I’ve set the current Project ID to my test project. You can set the Project ID on this file very easily by simply typing in the command that you see on screen echo. Project ID is equal to dollar dev shell underscore Project underscore ID and pass this into the BigQuery RC file.
This dev shell Project ID is a very handy environment variable that’s available on Cloud Shell which has the name of your current project. You’ll find that we need to use this really often. Instead of typing out the ID of your project, simply reference this environment variable. We can run these BigQuery commands on public data sets as well using the command line.
For example, the BQ Show command you can run on public data samples. Shakespeare to get information about this Shakespeare table, notice how the table is specified. Project ID Table this will show you all the information that was available in the Details tab of our Web UI when this table was last modified. Its schema the number of rows it has, the number of bytes it occupies, and so on.
The BQ head command can be used to preview some rows in the table. BQ head n and ten will give you ten rows from Shakespeare. Most of what you have in the web console you can do using the command line. Let’s now upload some data into Cloud Storage and from Cloud Storage load it into BigQuery in the bucket that I’ve set up earlier that is loony US bucket. I have a baby Names folder which contains all the baby names that I had downloaded earlier from the Social Security website. Let’s use the command line to load data into a BigQuery table from Cloud Storage.
This we do using the BQ Load command. The BQ Load command requires that you specify the format of your source files. In our case it is a CSV file. We want to load the CSV data into our Baby Names data set into a table called Names underscore 2011. The source data is located in our bucket. Give the full path to that bucket. This is the Loony Us bucket baby names year of birth 2011. Text the last parameter on the command line is the schema for this table. Just execute this command to load data from your cloud storage into your BigQuery table on the command line. You can use regular expressions to load data from multiple files in one go. Here we want to create a table called all underscore Names which has the data for all the years. Notice our regular expression there yob 20 star, which means all the data starting from the year 2000 up till the present will be present in this All Underscore Names table. Execute this command and now we can check the details of the table that we just created.
Just run BQ Show on our Names underscore 2011 table and you’ll find that the metadata shows that the table has been loaded just fine. Switch over to the BigQuery web console and you’ll find that All Underscore Names and Names underscore 2011 are present in your BB Names data set. Explore the All Names table and you’ll find that it has a total of over 200,000 rows and a table size of over four MB. You can run queries on the command line as well. Simply use BQ.
Space Query this query finds the most popular girl’s name over the last few years. It queries the all underscore names table that we just set up. Isabella was the most popular girls name for one of these years. Sophia was number one for multiple years running, and Emma was popular, too. You can also export data from BigQuery into a CSV file on cloud storage by using the BQ extract command.
Specify what table you want to access the data from, and the bucket where you want to store this information. Specify a full path, including the name of the file that you want to extract to. Once the export has been complete, you can switch over to the web console, look into the bucket, and ensure that the export has been successfully. There, you see it all names CSV. The answer to this question is now very straightforward. You would load data from cloud storage into BigQuery using the BQ load command on the command line.
When you use standard SQL in BigQuery, you can specify wild cards in table names, potentially querying multiple tables in one go. There are some special syntax constraints for this, though. What are they? That’s what you learn in this lecture. Let’s go ahead and study aggregate and Boolean functions in BigQuery. We are going to work on some interesting airlines data that’s available as a part of the BigQuery public data sets. However, I’ve noticed that in some cases, this data set is not readily available in the left navigation pane in the BigQuery Web console. In order to load these public data sets, just go ahead to this URL and click on this BigQuery Public Data Samples link. This link actually opens up your BigQuery web console. And on the left navigation pane, you might find some new sample data public data sets that didn’t exist before.
Maybe you already had access to those earlier. It doesn’t matter. For those who didn’t, they should now be loaded. This data is part of the BigQuery Samples project. Within this project, we’ll work with a data set that’s called Airline on Time Data, and within that there is a Flights table. This flights table contains a huge number of flights over a certain period of years, which airport they departed from, which airport they landed at, and whether they were on time or not. You can spend some time looking at the schema for this particular table. It’s very interesting and it lends itself to all kinds of useful analysis. Let’s take a look at the columns here the date on which the flight occurred, the airline, the airline code, Departure Airport, state latitude and longitude, arrival Airport, and so on.
In this demo, we’ll try and build up a really complex query which allows us to figure out the fraction of flights delayed per airport due to rains. We’ll focus on LaGuardia, but you can carry out this analysis for any airport. The columns that will be of interest to us in this query is this Departure Delay column that’s highlighted, and below that, the Arrival delay as well. We’ll be using both of these. Go ahead and hit the Compose query button and let’s set up our very first query to sample what this data actually looks like. You need to get a feel for the data before you start running complex aggregations on it. This query selects the airline date and departure delay from the table, and the condition is that the departure delay should be greater than zero.
The flight was delayed when it was leaving LaGuardia, and we want just 100 results. Notice though, that our valid data on the bottom right indicates that this query is not valid. There’s something wrong here. If you click on the validator, it will give you a hint which says Invalid Table name. But if you look on the left, this table name is perfectly valid. Within BigQuery Samples, we have the Airline on time data, and within that the flights table. So what is it that’s going on here? Why is this table name not valid? The answer lies in how this table has been specified. Notice that we’ve used Backpacks to surround the table. This is because backpacks in Standard SQL allows us to specify wildcards within the table name and query multiple tables in one go. These back takes are, however, not valid in Legacy SQL, which means we need to change the sequel dialect that we used to run this query. Click on the Show Options button and uncheck the Use legacy SQL checkbox.
This will end up using the standard SQL dialect, which is what we want. When you do this, you’ll find that the validator immediately turns green, indicating your query is now valid. Now that our query is set up correctly, let’s see what information we are going to get. We are going to get the airline date and departure delay for all flights from La Guardia, all delayed flights across the entire data set. We are only going to sample 100 of these records though.
Run this query and you’ll get the result at the very bottom. Examine the result and see if you can understand how the data is laid out. Notice that for the airline nine E across multiple dates, we get what the departure delay for each flight was. The result is for each individual flight that took off from La Guardia and how many minutes it was delayed. This information is more interesting if we can perform some aggregations on it. Let’s remove the limit of 100 and group this information by airline and also order it by airlines. Find the number of delayed flights by airline and still limit it to LaGuardia Airport so we’ll see how different airlines perform in LaGuardia. That can help us make a choice as to which airline we want to choose when we make our trip.
Groupings are generally accompanied by aggregations. Let’s count the number of flights that were delayed so we have the airline and the number of flights of those airlines that were delayed when taking off from LaGuardia. Let’s get these results in some order so it’s not haphazard. Let’s just do it by descending order of airline. Notice that when we reduce the number of columns that we select using BigQuery, we’ve also reduced the size of the data set that we’re going to process. The earlier size was around 1. 9gb. Now it’s just one point twelve.
We’ve eliminated the date column from our query. You can see the results at the bottom. The total number of flights for each airline that were delayed when they took off from La Guardia. This is across multiple dates. That’s why the numbers are so huge. Play around with the query parameters a little bit and make sure that you really understand this query. We are going to be using this query as a part of a larger query later on in this lab. The end objective is to find the fraction of flights delayed for rainy days on a per airline basis. Let’s filter this data set by date, because that’s also possible, and see the format of how the date is set up for this particular data set. And here are the results at the bottom. For 13 May 2008, nine E has four flights delayed. American Airlines, on the other hand, has 20 delayed flights. Change the dates to a few different values and see how the data looks. At this point, we have a count of how many flights were delayed on a single day on a per airline basis, all for LaGuardia. Let’s see if we can make this query a little more interesting.
In a single query, let’s try and calculate two different things the total number of flights that took off from LaGuardia on a single day and the number of those flights that were delayed. Pause and think about what we are really trying to do here. We want to run two different aggregations. On the same Departure delay column. We want to count the total number of flights as well as those which were delayed. This is a sample query which allows you to do exactly that. Let’s analyze this query by focusing on the parts which we’ve seen before. We want to query from the Flights table. We are aliasing this table as F, so we can reference it easily. The condition is that the departure airport is LaGuardia and the departure date is the 13 May 2008. We want to group this data by airline. This should be very familiar. This is what we ran earlier. The select statement has some interesting stuff going on though. We want to select the Airlines column because we want to know what airline had these delayed flights. And we want to count the total number of flights that each airline had taking off from LaGuardia. So we simply count the number of entries in the Departure delay column.
Notice that there is no condition on departure delay. In the Where clause of this column. We want to count all the records. Find the total number of flights for every airline. To find the number of flights that were delayed within the same query, we’ll do something interesting. We’ll use a conditional within an aggregation. This is something that BigQuery allows here. Within the sum aggregation we have an if statement. This if statement checks whether for a particular flight it was delayed or not. If departure delay is greater than zero, it will count that record as one. A value of one indicates that this record will contribute to this sum aggregation.
So if F departure delay is greater than zero, consider this record to have one associated with it. The else is count this record as zero. If departure delay is not greater than zero, assume zero to be associated with this record. You can now imagine that we have a virtual column comprising of zero or one one if the flight was delayed and zero if it was not. Apply a sum operation to this virtual column and you’ll get the number of flights that were delayed. And that’s it.
We’ve taken a very simple query and applied a conditional statement within one of its aggregations to find the information that we are looking for. Run this query to see what the results look like. You see the airline, the total number of flights and the number of those flights that were delayed. For the 13 May 2008, we’ve completed one part of this lab, but it should have shown you that when you’re using standard SQL in BigQuery, make sure you enclose your table name in Backpacks. These backpacks will also allow you to specify wildcard expressions to query multiple tables in one go.
Popular posts
Recent Posts