Coursework | FIT5145: Foundations of Data Science Assignment 4: Shell Commands, Data Collection, Exploratory Data Analysis and Predictive Data Analysis

Assessment Details: Assignment Writing Service

Faculty of Information Technology Semester 2, 2024 Assignment Writing Service

FIT5145: Foundations of Data Science Assignment 4: Description
Friday, Week 13 (October 25, 2024) 11:55 PM Assignment Writing Service

● Assessment Type: Individual Assignment Assignment Writing Service

Hand in Requirements: Assignment Writing Service
In this assignment, three files (PDF report, RMD file, and csv file) should be submitted. Assignment Writing Service

1. A report in PDF containing your (a) code, (b) answer, and (c) explanation used to answer each question. Please make sure that your answers to all the questions are numbered correspondingly. Assignment Writing Service

(a) code: Make sure to include all the shell commands for Task A and the R codes for Tasks B-D in the PDF report. For the shell commands, please copy your codes and paste into Word or other word processing software (Please do NOT take the screenshots of your code). Assignment Writing Service

For the R codes, please directly convert the RMD file including your codes into the PDF file (Note: Please Knit RMD to HTML and print the HTML as pdf). If you want to use Microsoft Word or other word processing software to format your submission, please copy your codes from the RMD file and paste into Word (Please do NOT take the screenshots of your code). Assignment Writing Service

You need to merge the shell PDF and R PDF files into a single pdf file.
(b) answer: Please make sure to include screenshots/images of the code outputs Assignment Writing Service

and written answers (not screenshot) for each question of Tasks A-D in PDF. (c) explanation: Please explain how you answered each question (i.e., explaining Assignment Writing Service

your codes or summarising your work for each question).
Marks will be assigned to reports based on their correctness and clarity. For instance, Assignment Writing Service

higher marks will be given to reports containing graphs with appropriately labelled Assignment Writing Service

axes.
2. The RMarkdown file: Please submit the RMarkdown file that contains your R codes Assignment Writing Service

for Tasks B-D of this assignment. Your file should contain all the codes, proper comments, and any instructions of libraries that need to be installed. Assignment Writing Service

Notes: Assignment Writing Service

Whenever a question asks for a certain value, your code should produce the value. For example, when a question asks for the number of rows contained in a table, your code should print out the answer. Extraction of the answer manually will not earn any marks. Assignment Writing Service
Assignment should be submitted in three files (PDF report, RMD file and csv file): Assignment Writing Service

(a) An RMD file that generates errors when running will not be considered Assignment Writing Service

3. Please make sure that you can select and highlight texts in your PDF, as shown below then the turnitin score can be generated properly for your PDF file (we just need the Turnitin score for the PDF file, not the RMD file). Assignment Writing Service

Task A: Shell commands Assignment Writing Service

In this task, you are required to explore and wrangle the data in the file “covid19-cable-broadcast.csv”, which contains transcript paragraphs that were collected from various programs aired in 2020 on cable and broadcast news networks, e.g., World News Tonight on ABC and Anderson Cooper 360 Degrees on CNN. These transcript paragraphs have been manually annotated according to their relevance to COVID-19. The file contains different variables to describe each collected transcript paragraph, as described below. Assignment Writing Service

Column Name Assignment Writing Service	Description Assignment Writing Service
ID Assignment Writing Service	The ID of the paragraph Assignment Writing Service
network Assignment Writing Service	Cable and broadcast news networks such as ABC, CNN, FOX, etc. Assignment Writing Service
program Assignment Writing Service	Programs aired on cable and broadcast news networks, such as “worldnewstonight” (the program World News Tonight on ABC) Assignment Writing Service
date Assignment Writing Service	The date when the corresponding program and paragraph aired Assignment Writing Service
paragraph Assignment Writing Service	The textual content of the paragraph Assignment Writing Service
category Assignment Writing Service	The categorical label provided by humans to indicate a paragraph’s relevance to COVID-19, i.e., covid_direct, covid_indirect, and non_covid Assignment Writing Service

Please note that you are only allowed to use shell commands as you would run in Linux shell, Mac terminal, or Cygwin, to tackle this task. Using other utilities or tools such as PowerShell is NOT allowed. Assignment Writing Service

1. 2. Assignment Writing Service

What is the date range of the collected paragraphs? Assignment Writing Service

We want to preprocess the ID and date columns. Assignment Writing Service

Count lines with an id that is not a number of 6 digits long, i.e., id values that contain anything other than numbers OR are of a length more/less than 6. Assignment Writing Service
Remove the lines mentioned in Q2-a and remove time values in the date column. For example, the date column will contain “29/04/2020”, instead of having “29/04/2020 23:13”. Assignment Writing Service

c. Display the first 3 lines of the dataset that was filtered in Question 2-b. Store the filtered dataset in a file named “filtered_covid.csv” and use this file for the remaining questions in Task A. Assignment Writing Service

When was the first and last mention of the term “Australia” in the column paragraph? Please note that the first and last mention of a term refers to the chronologically earliest Assignment Writing Service

3. Assignment Writing Service

4. Assignment Writing Service

5. Assignment Writing Service

and latest paragraph containing the term in the dataset and the term to be searched is case sensitive. Assignment Writing Service

Let’s investigate the program column. Assignment Writing Service

How many unique values are there in the program column? Assignment Writing Service
Can you write commands to list the top 5 most frequent program values in the dataset Assignment Writing Service

(i.e., the top 5 programs with the largest number of paragraphs)? Assignment Writing Service

Let’s investigate the paragraph column. Assignment Writing Service

How many paragraphs contain both “ventilator” and “hospital”? (Note: Please ignore cases and consider variations.) Assignment Writing Service
How many paragraphs mention unemployment statistics? (Note: Please ignore cases and consider variations) Assignment Writing Service

6. In the following, please generate the dataset filtered according to the following conditions: Assignment Writing Service

Keep these columns: network, program, date, paragraph, and category Assignment Writing Service
Keep the paragraphs satisfying the following conditions: (i) the date belongs to an odd month (e.g.: January, March, ..., November); (ii) the network is cbs; and (iii) the Assignment Writing Service

category is covid_direct. Assignment Writing Service

Then, print out the first and last date of the filtered dataset (Please include a header). Assignment Writing Service

Task B: Data Collection and Exploratory Data Analysis Using R Assignment Writing Service

There are many ways to collect data from different sources. One of them is web scraping. In this task, you are required to scrape data from websites, wrangle data scrapped if required, and visualise them. Assignment Writing Service

Task B1:
Please extract the following table, “Historical rankings” from the web, ICC Men's T20I Team Assignment Writing Service

Rankings in Wikipedia (Note: please extract the entire table). (https://en.wikipedia.org/wiki/ICC_Men%27s_T20I_Team_Rankings). Assignment Writing Service

Then, please print out the following summarised table. Note: You might need to pre-process the data that you extracted from the web. Assignment Writing Service

Country Assignment Writing Service	Earliest_start Assignment Writing Service	Latest_end Assignment Writing Service	Average_duration Assignment Writing Service
Pakistan Assignment Writing Service	2017-11-01 Assignment Writing Service	2020-04-30 Assignment Writing Service	294.67 Assignment Writing Service
Sri Lanka Assignment Writing Service	2012-09-29 Assignment Writing Service	2016-02-11 Assignment Writing Service	212.80 Assignment Writing Service
India Assignment Writing Service	2014-03-28 Assignment Writing Service	2024-10-03 Assignment Writing Service	197.00 Assignment Writing Service
New Zealand Assignment Writing Service	2016-05-04 Assignment Writing Service	2018-01-27 Assignment Writing Service	191.33 Assignment Writing Service
England Assignment Writing Service	2011-10-24 Assignment Writing Service	2022-02-20 Assignment Writing Service	187.00 Assignment Writing Service
Australia Assignment Writing Service	2020-05-01 Assignment Writing Service	2020-11-30 Assignment Writing Service	106.00 Assignment Writing Service
South Africa Assignment Writing Service	2012-08-08 Assignment Writing Service	2012-09-28 Assignment Writing Service	21.00 Assignment Writing Service
West Indies Assignment Writing Service	2016-01-10 Assignment Writing Service	2016-01-30 Assignment Writing Service	21.00 Assignment Writing Service

The summarised table shows the earliest start, latest end, and average duration for each country. (Note: Please replace the 'Present' value in India with the current date you are working with.) Assignment Writing Service

Task B2: Please choose a different website that you are interested in and follow the instructions below: Assignment Writing Service

Scrape data contained in table format in a website Assignment Writing Service
Wrangle data if required. Assignment Writing Service
Create a plot for the scraped data. Assignment Writing Service
Discuss the information or insights that can be drawn from the chart. Assignment Writing Service

Task C: Exploratory Data Analysis using R Assignment Writing Service

For Task C, you are required to visualise the relationship between the births, deaths, total fertility rate (TFR), net overseas migration (Births) and net interstate migration (NIM) for the different Australian states/territories, and gain insights on how these relations and trends change over time. The data files used in this task were originally downloaded from the Australian Bureau of Statistics. We have extracted the data from the original files and transformed them into a simpler format. Please download the data from Moodle: Assignment Writing Service

● Births.csv - This file contains yearly data regarding the recorded number of births by Australian state/territory of registration between 1977 and 2016. Assignment Writing Service

● Deaths.csv -This file contains yearly data regarding the recorded number of deaths by Australian state/territory of registration between 1977 and 2016. Assignment Writing Service

● TFR.csv - This file contains yearly data on the recorded average number of births per woman over her lifetime by each state/territory between 1971 and 2016. Assignment Writing Service

● NOM.csv - This data file (Net Oversea Migration) contains yearly data on the net gain or loss of population through immigration (migrant arrivals) to Australia and emigration (migrant departures) from Australia, for the period between 1977 and 2016. Assignment Writing Service

● NIM.csv- This data file (Net Interstate Migration) contains yearly data on the net gain or loss of population through the movement of people from one state or territory to another, for the period between 1977 and 2016. Assignment Writing Service

B1. Investigating the Births, Deaths and TFR Data Assignment Writing Service

Draw the number of births and deaths recorded in each state or territory over different years, and describe the plot. Assignment Writing Service
Next, plot the natural growth in Australia's population over different years. Describe the plot. Assignment Writing Service
Inspect the data on Total Fertility Rate (TFR.csv) for Queensland and Northern Territory. Assignment Writing Service

a. What was the minimum value for TFR recorded in the dataset for Queensland and when did that occur? Assignment Writing Service

b. What was the corresponding TFR value for Northern Territory in the same year? Assignment Writing Service

4. Identify two additional variables from external sources that might affect the Total Fertility Rate (TFR) in New South Wales (NSW) and analyse the relationship between these variables and the TFR. Marks will be awarded based on the depth of your investigation (including the analyses and discussions conducted) (Note: You need to provide the data source links) Assignment Writing Service

B2. Investigating the Migration Data (NOM and NIM) Assignment Writing Service

1. Let’s look at the Net Overseas Migration (NOM) data in different states over time. Assignment Writing Service

a. Use R to plot the NOM to Victoria, Tasmania and Western Australia over time. Explain and compare the trend in all three states (VIC, TAS and WA). What do you observe? Assignment Writing Service

b. Plot the NOM to Australia over time. Assignment Writing Service

Identify two additional variables from external sources that might affect the NOM in Assignment Writing Service

Australia and analyse the relationship between these variables and the NOM. Marks will be awarded based on the depth of your investigation (including the analyses and discussions conducted) (Note: You need to provide the data source links) Assignment Writing Service
Create a table to display the states with the highest and lowest total migration shifts for each year. Assignment Writing Service
Create a plot to show the proportion of total migration by each state for each year and provide a discussion of the plot. Assignment Writing Service

Task D: Predictive Data Analysis using R Assignment Writing Service

Do you think the FLoRA chatbot powered by GPT-4o is useful for solving Assignment 1? In this task, you will be asked to analyse the conversational data generated by students in this unit when interacting with the chatbot and perform predictive data analysis to characterise the usefulness of a dialogue. Please download the conversational data files from Moodle. All data has been anonymised. Assignment Writing Service

You are required to build machine learning models to predict the usefulness of a dialogue represented in numerical scores. In total, we collected and pre-processed a total of 242 dialogues, and you can access 70% of these dialogues (shared in the data files “dialogue_utterance_train.csv” and “dialogue_usefulness_train.csv”), which are randomly selected as the training set that you can use to build machine learning models. Among the remaining 30% dialogues, 15% of them are randomly selected as the validation set and can be accessed via the files “dialogue_utterance_validation.csv” and “dialogue_usefulness_validation.csv”. The other 15% are used as the test set and can be accessed via the files “dialogue_utterance_test.csv” and “dialogue_usefulness_test.csv”. Please refer to Table 1 and Table 2 to know the meaning of each feature/column. Assignment Writing Service

Table 1: Description of columns in the data files “dialogue_utterance_train/validation/test.csv” Assignment Writing Service

Column Name Assignment Writing Service	Description Assignment Writing Service
Dialogue_ID Assignment Writing Service	The unique ID of a dialogue Assignment Writing Service
Timestamp Assignment Writing Service	When an utterance contained in the dialogue was made Assignment Writing Service
Interlocutor Assignment Writing Service	Whether the utterance was made by the student or the chatbot Assignment Writing Service
Utterance_text Assignment Writing Service	The text of the utterance Assignment Writing Service

Table 2: Description of columns in the data file “dialogue_usefulness_train/validation/test.csv” Assignment Writing Service

Column Name Assignment Writing Service	Description Assignment Writing Service
Dialogue_ID Assignment Writing Service	The unique ID of a dialogue Assignment Writing Service
Usefulness_score Assignment Writing Service	This score is given by a student to indicate their perceived usefulness of the FLoRA chatbot when answering the post-task questionnaire Question 3 (i.e., “To what extent do you think the GPT-powered chatbot on FLoRA is useful for you to accomplish the assignment?”). The value range of this feature is [1,5], with 1 representing “very unuseful”, 2 representing “unuseful”, 3 representing “neutral”, 4 representing “useful”, and 5 representing “very useful”. Assignment Writing Service

If the dialogue you generated is included as part of the training set, you need to first exclude it before answering the following questions. The Dialogue_ID of your dialogue will be shared with you via email. Assignment Writing Service

1. What features can you engineer to empower the training of a machine learning model? You may propose as many as you believe are useful. Please note that the number of the features should not exceed the number of the dialogues contained in the training set. Otherwise, the constructed machine learning models are prone to have overfitting issues. Select two features that you propose and try to use boxplots to visualise the feature value between the following two groups of dialogues in the training set: (i) those with Usefulness_score of 1 or 2; and (ii) those with Usefulness_score of 4 or 5. Is there any difference between the two groups of dialogues? How can you tell whether the difference is statistically significant? Higher marks will be given to the identification of features that display statistically significant differences. Assignment Writing Service

Build a machine learning model (e.g., polynomial regressions, regression tree) by taking all the features that you have proposed and evaluate the performance of the model on the validation set using the relevant evaluation metrics you learned in class. The best-performing model here is denoted as Model 1. Assignment Writing Service
Now we want to improve the performance of Model 1 (i.e., to get a more accurate model). For example, you may try some of the following methods to improve a model: Assignment Writing Service
- ● Select a subset of the features (especially the important ones in your opinions) as input to empower a machine learning model or a subset of the data in a dialogue (given that some questions asked by students might not be directly relevant to solving the assignment). Assignment Writing Service
- ● Deal with errors (e.g.: filtering out data outliers). Assignment Writing Service
- ● Rescale data (i.e., bringing different variables with different scales to a Assignment Writing Service
  
  common scale). Assignment Writing Service
- ● Transform data (i.e., transforming the distribution of variables). Assignment Writing Service
- ● Try other machine learning algorithms that you know. Assignment Writing Service
  
  Please build the predictive models by trying some of the above methods or some other methods you can think of and evaluate the performance of the models and report whether Model 1 can be improved. Assignment Writing Service
  
  You need to explain how you have improved your model by including code, output, and explanations (explaining the code or the process) and justify why you have chosen some of the above methods or some other methods to improve a model (e.g., why this subset of the variables are chosen to build a model). Marks will be given, based on the depth of investigation required to improve a model, as well as the sufficient justification provided for the proposed approaches. Higher marks will be given to answers which successfully demonstrate model performance improvement. Assignment Writing Service
What is the Dialogue_ID of the dialogue you generated? Please copy and paste the whole dialogue text that you generated with the chatbot here. With the best-performing model constructed from Question 2&3, what is the prediction value for the dialogue you generated? Is the prediction value close to the groundtruth value? If yes, what features do you think play important roles here to enable the model to successfully make the prediction? How can you determine the importance of features quantitatively? If not, what might be the reasons? For students whose dialogues are included in the test set, you may randomly select a dialogue from the validation set to analyse and answer this question. Assignment Writing Service
Please notice that the groundtruth Usefulness_score values in the file “dialogue_usefulness_test.csv” are withheld for now, but they will be shared after the due date of this assignment. Here, your task is to use the best-performing model constructed from Question 2&3 to predict the usefulness of the dialogues contained in Assignment Writing Service

the test set. You need to populate your prediction results (i.e., the predicted Usefulness_score values) into the file “dialogue_usefulness_test.csv” and upload it to Moodle to measure the overall performance of your model. Please ensure the number of columns and rows remains the same as in the original file (dialogue_usefulness_test.csv), and only fill in the prediction results in the 'Usefulness_score' column. Please name the submission file using the following format: Assignment Writing Service

LastName_StudentNumber_dialogue_usefulness_test.csv.
The mark you receive for this question will be dependent on the performance level of your model (measured by RMSE). Assignment Writing Service