blog #1: learning spreadsheet software 

The first data analytics tool that I learned to use was Google sheets, it’s a spreadsheet software that’s flexible and versatile. I started to gravitate towards the pivot table feature to quickly summarize large amounts of data, analyze numerical data in detail, and answer unanticipated questions about my data. 

Here's an example to demonstrate how pivot tables work. 


2. Click the menu Insert > Pivot table

This will create a new tab in your sheet called "Pivot Table 1" (or 2,3,4, etc as you create more) with the Pivot Table framework in place. 

3. Click Rows in the Pivot table editor and add Property Type

5. Click Values in the Pivot table editor and add Property Type

6. Click Values in the Pivot table editor and add Sales price. 

That's it! The pivot table is now created and summarizes the data for each property type. It counts how many of each property type is found in the dataset and then calculates the sales price to give a total sales price value for each property type category. For example, the seven rows of data for Apartments are combined together into a single line in the Pivot Table

blog #2: data visualization with ggplot2 

The ggplot2 is an R package dedicated to data visualization. It provides programmatic interface for specifying what variables to plot, how they are displayed, and general visual properties. I like using bar graphs to visualize data of different categories and scatter plots to visualize how two variables are correlated. 

Here's an example of creating a bar graph on R. 


2. Put the data into a data frame.

> survey <- data.frame(fruit=c("Apple", "Banana", "Grapes", "Kiwi", "Orange", "Pears"),

+                      people=c(40, 50, 30, 15, 35, 20))

> survey

   fruit people

1  Apple     40

2 Banana     50

3 Grapes     30

4   Kiwi     15

5 Orange     35

6  Pears     20

3. To create a bar graph, use ggplot() with geom_bar(stat="identity") and specify what variables to put on the X and Y axes. 

> # Create a basic bar graph with ggplot

> library(ggplot2)

> ggplot(survey, aes(x=fruit, y=people)) +

         geom_bar(stat="identity")

4. The bar graph is now complete. By default, bar graphs use a dark grey color filling. 

Based on the previous example I will demonstrate how to use the fill argument to apply different colors. 

> # Change the colors of individual bars (default fill colors)

> ggplot(survey, aes(x=fruit, y=people, fill=fruit)) + 

+       geom_bar(stat="identity")

The ggplot2 package provides some premade themes to change the overall plot appearance. With themes, the user can easily customize some commonly used properties, like background color, panel background color, and grid lines.

> # Change the ggplot theme to 'Minimal'

> ggplot(survey, aes(x=fruit, y=people, fill=fruit)) + 

+        geom_bar(stat="identity") +

+        theme_minimal()

Here's an example of creating a scatter plot on R. 


> # First six observations of the 'Iris' data set

> head(iris)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1          5.1         3.5          1.4         0.2  setosa

2          4.9         3.0          1.4         0.2  setosa

3          4.7         3.2          1.3         0.2  setosa

4          4.6         3.1          1.5         0.2  setosa

5          5.0         3.6          1.4         0.2  setosa

6          5.4         3.9          1.7         0.4  setosa

2. To create a scatter plot, use ggplot() with geom_point() and specify what variables to put on the X and Y axes.

> # Create a basic scatter plot with ggplot

> ggplot(iris, aes(x=Petal.Length, y=Petal.Width)) +

+  geom_point()

3. The scatter plot is now complete. 

blog #3: data extraction with SQL

ORDER BY keyword: 


2. I can do this by listing the orders in descending order (10–1, Z–A) by value. 

SELECT *

FROM Orders

ORDER BY Total_Value DESC;

3. The output is: 

Using a ‘*’ after SELECT tells the SQL engine to retrieve all the columns for a table. The ORDER BY clause directs that the results be shown in a particular order. By default, the results are returned in ascending order (1–10, A–Z); however, since I explicitly mention the DESC keyword, my results are returned in descending order.

CASE command: 

Based on the previous example, a business will have thousands of rows in the Orders table. The user may not always want to see the total value of each order and instead would prefer to categorize them based on value. 


SELECT Order_Id, Total_Value

       CASE WHEN Total_value > 150 THEN 'HIGH'

            WHEN Total_value < 150 THEN 'LOW'

            ELSE 'MEDIUM' END as Order_Category

FROM Orders;

2. The output is: 

In this query, each row is evaluated by the CASE function. Rows are assigned a category based on the first condition (WHEN .. THEN ..) that evaluates to true. Since the orders with the IDs of 13, 11 and 22 are above $150, they are categorized as ‘HIGH’. The order with ID 23 is assigned ‘LOW’. The ELSE specifies a default return value if none of the conditions evaluate to true for any row.

GROUP BY statement: 

To generate insights from large data sets, the user may want to group similar items together to make informed decisions. Grouping the data helps find patterns, behaviors, or metrics. This grouping is called aggregation; it can be achieved with the GROUP BY clause. Commonly used aggregate functions are SUM(), COUNT(), AVG(), MAX(), and MIN()


SELECT Customer_Id, SUM(Total_Value)

FROM Orders

GROUP BY Customer_Id

ORDER BY 2 DESC;

3. The top customer is Customer_Id 1213. 

The query works by first selecting all the unique Customer_Id values and then using GROUP BY to find the aggregate total sum for each (the total value of all their orders). The ORDER BY 2 DESC clause displays the results in decreasing order by SUM(Total_Value).

JOIN clause: 

In the previous example, I was able to find the Customer_Id of the highest-spending customer, but the Orders table doesn’t store any other customer information, such as name or communication details. 



SELECT b.Customer_Name, b.Email, a.Customer_Id, SUM(Total_Value)

FROM Orders a

JOIN Customers b

ON a.Customer_id = b.Customer_Id

GROUP BY b.Customer_Name, b.Email, a.Customer_Id

ORDER BY 4 DESC;

3. The output is: 

The query returns customer details I wanted. It works by comparing the Customer_Id fields in both tables and then retrieving only those rows where it finds a match for the values. Notice that the customer IDs 1211, 1214 and 1215 are not in the results; there were no matching rows in the Orders table for these values.