Analysis of Airlines using SQL
Wanted to flex my SQL skills here and answer these questions using an fights dataset I found one of the corners of the internet. Here are the questions I tired to answer:
How many records are in each table in the database?
Did any flights depart and arrive exactly on time?
What is the mean arrival delay among flights?
How many flights had above average arrival delays?
What is the relationship between temperature and wind speed? Do changes in the mean temperature throughout the year (by month) move reflect changes in mean wind speed?
Print the first 5 rows of flights and airlines. Does the flights table contain the full name of the airline? If so, print them. If not, write a query to return the first 10 records in the flights data with the following columns: the origin, destination, carrier, flight number, and full name of the airline for each flight in a single table.
Your supervisor is interested in the airlines that offer the most flights in peak periods. Write a query that returns the number of flights each airline operates in December. The results should be a single table that contains five columns: carrier, flight, month, name, and num_flights. The results should be limited to 10 and shown in descending order by number of flights.
Use pandas to connect to the database. Define a function that takes one argument--month_num--called num_flights_fun that uses pandas' `.read_sql_query()` and runs a query that returns the number of flights each airline operates in whatever month is specified in the function call. The result should be returned as a pandas dataframe that contains five columns: carrier, month, name, and num_flights.
Test your function to ensure it works. Create a new dataframe called sept that contains the results with only records from September and a dataframe called jan that contains the results with only records from January. The only change you should make to get the results is adjusting the value of the input parameter you are passing in the function call in num_flights_fun. Print out the last five rows of sept and jan.
Using sept and jan, print the top five airlines with the greatest number of flights in September and in January, in descending order. Are the top five airlines the same in September and January? If so, what are the airlines? If not, which airlines differ from September to January?
Create a new function using pandas called num_flights_fun_all that runs the same query as above but returns the records for all months. The function should not take any arguments.
Use num_flights_fun_all to create a dataframe called months_df. The dataframe should have carrier, month, name, and num_flights columns and contain the number of flights that each airline made in each month. Show the first 15 lines of the dataframe to ensure the results are correct.
Use months_df to plot the number of flights each airline made for each month using altair, plotly, or seaborn.
Write a query that returns the each airlines' mean arrival delay across all of its flights. The result should have two columns: mean_delay and the full name of the airline. Which airline has the highest average delay?
Use pandas to query the data and return a dataframe to plot the data. Use either altair, seaborn, or plotly and plot the mean delay for each airline in a bar plot. The plot should have the airline name on the y and mean delay on the x and be ordered by mean delay. The bars should be colored by whether the airlines' flights arrive, on average, early or late. Which airline, on average, has the earliest arrivals?
Want to know what answers I got, click the GitHub repository button below!