blog #1: learning spreadsheet software
The first data analytics tool that I learned to use was Google sheets, it’s a spreadsheet software that’s flexible and versatile. I started to gravitate towards the pivot table feature to quickly summarize large amounts of data, analyze numerical data in detail, and answer unanticipated questions about my data.
Here's an example to demonstrate how pivot tables work.
- Consider this dataset. My main goal is to summarize the data and answer questions such as: how many apartments are there in the dataset? what’s the total cost of all the apartments? (After opening it click somewhere inside the data)
2. Click the menu Insert > Pivot table
This will create a new tab in your sheet called "Pivot Table 1" (or 2,3,4, etc as you create more) with the Pivot Table framework in place.
3. Click Rows in the Pivot table editor and add Property Type
5. Click Values in the Pivot table editor and add Property Type.
6. Click Values in the Pivot table editor and add Sales price.
That's it! The pivot table is now created and summarizes the data for each property type. It counts how many of each property type is found in the dataset and then calculates the sales price to give a total sales price value for each property type category. For example, the seven rows of data for Apartments are combined together into a single line in the Pivot Table
blog #2: data visualization with ggplot2
The ggplot2 is an R package dedicated to data visualization. It provides programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. I like using bar graphs to visualize data of different categories and scatter plots to visualize how two variables are correlated.
Here's an example of creating a bar graph on R.
- Consider the survey below. It was conducted from a group of 190 individuals, who were asked “What’s your favorite fruit?”
2. Put the data into a data frame.
> survey <- data.frame(fruit=c("Apple", "Banana", "Grapes", "Kiwi", "Orange", "Pears"),
+ people=c(40, 50, 30, 15, 35, 20))
> survey
fruit people
1 Apple 40
2 Banana 50
3 Grapes 30
4 Kiwi 15
5 Orange 35
6 Pears 20
3. To create a bar graph, use ggplot() with geom_bar(stat="identity") and specify what variables to put on the X and Y axes.
> # Create a basic bar graph with ggplot
> library(ggplot2)
> ggplot(survey, aes(x=fruit, y=people)) +
geom_bar(stat="identity")
4. The bar graph is now complete. By default, bar graphs use a dark grey color filling.
Based on the previous example I will demonstrate how to use the fill argument to apply different colors.
> # Change the colors of individual bars (default fill colors)
> ggplot(survey, aes(x=fruit, y=people, fill=fruit)) +
+ geom_bar(stat="identity")
The ggplot2 package provides some premade themes to change the overall plot appearance. With themes, the user can easily customize some commonly used properties, like background color, panel background color, and grid lines.
> # Change the ggplot theme to 'Minimal'
> ggplot(survey, aes(x=fruit, y=people, fill=fruit)) +
+ geom_bar(stat="identity") +
+ theme_minimal()
Here's an example of creating a scatter plot on R.
- Consider the iris flower data set below.
> # First six observations of the 'Iris' data set
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
2. To create a scatter plot, use ggplot() with geom_point() and specify what variables to put on the X and Y axes.
> # Create a basic scatter plot with ggplot
> ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +
+ geom_point()
3. The scatter plot is now complete.
blog #3: data extraction with SQL
ORDER BY keyword:
- Consider the Orders database table below. I want to see the top orders and their details.
2. I can do this by listing the orders in descending order (10–1, Z–A) by value.
SELECT *
FROM Orders
ORDER BY Total_Value DESC;
3. The output is:
Using a ‘*’ after SELECT tells the SQL engine to retrieve all the columns for a table. The ORDER BY clause directs that the results be shown in a particular order. By default, the results are returned in ascending order (1–10, A–Z); however, since I explicitly mention the DESC keyword, my results are returned in descending order.
CASE command:
Based on the previous example, a business will have thousands of rows in the Orders table. The user may not always want to see the total value of each order and instead would prefer to categorize them based on value.
- Therefore the CASE function is used to categorize orders into ‘High’ or ‘Low’ value depending on whether the order value is above or below $150.
SELECT Order_Id, Total_Value
CASE WHEN Total_value > 150 THEN 'HIGH'
WHEN Total_value < 150 THEN 'LOW'
ELSE 'MEDIUM' END as Order_Category
FROM Orders;
2. The output is:
In this query, each row is evaluated by the CASE function. Rows are assigned a category based on the first condition (WHEN .. THEN ..) that evaluates to true. Since the orders with the IDs of 13, 11 and 22 are above $150, they are categorized as ‘HIGH’. The order with ID 23 is assigned ‘LOW’. The ELSE specifies a default return value if none of the conditions evaluate to true for any row.
GROUP BY statement:
To generate insights from large data sets, the user may want to group similar items together to make informed decisions. Grouping the data helps find patterns, behaviors, or metrics. This grouping is called aggregation; it can be achieved with the GROUP BY clause. Commonly used aggregate functions are SUM(), COUNT(), AVG(), MAX(), and MIN().
- Consider the following example: A business owner wants to send a gift card to the highest-spending customer.
- Use the GROUP BY clause to retrieve the top customer in terms of total order value.
SELECT Customer_Id, SUM(Total_Value)
FROM Orders
GROUP BY Customer_Id
ORDER BY 2 DESC;
3. The top customer is Customer_Id 1213.
The query works by first selecting all the unique Customer_Id values and then using GROUP BY to find the aggregate total sum for each (the total value of all their orders). The ORDER BY 2 DESC clause displays the results in decreasing order by SUM(Total_Value).
JOIN clause:
In the previous example, I was able to find the Customer_Id of the highest-spending customer, but the Orders table doesn’t store any other customer information, such as name or communication details.
- The JOIN clause combines records from two or more tables in a database in a single query.
- For this example I will use Customer_Id as the common field and will join the Customers and Orders tables.
SELECT b.Customer_Name, b.Email, a.Customer_Id, SUM(Total_Value)
FROM Orders a
JOIN Customers b
ON a.Customer_id = b.Customer_Id
GROUP BY b.Customer_Name, b.Email, a.Customer_Id
ORDER BY 4 DESC;