Visualization scenarios, socially responsible computing, parting words, assignment 6 - data visualization.
Due : 11:59PM Eastern Time, July 30th , 2021
Getting the stencil
You can click this link to get the stencil for this assignment.
Important : Please view Appendix A for the stencil code structure.
Python Data Visualization Modules
In this assignment, you may use packages that have not been installed on our course virtual environment. Please refer to Appendix B for instructions on how to install new modules on your virtual environment. You will not have the write access to install packages on the our course virtual environment on the department filesystem, so please let the HTA know if there is any Python module that is yet available in the course environment and you think should be added to the official course virtual environment.
Some modules that we recommend using in this assignment are: Pandas , Matplotlib , Seaborn , Plotly , and Sklearn (for your Machine Learning models, and for decision tree visualization ).
Before you start…
Firstly, we recommend finishing the lab first before working on the assignment.
We also care about accessible data visualization. Before you start designing your dashboard, we want you to read the following articles about accessible Data Visualization:
- “ 5 tips on designing colorblind-friendly visualizations ” (also covered in lab)
- “ Why Accessibility Is At The Heart of Data Visualization ” - particularly, pay attention to the Design equivalent experiences for all your readers section.
- “ A Comprehensive Guide to Accessible Data Visualization ” - the article should provide you with specific suggestions to make data visualization accessible to people with visual impairments.
Additionally, we hope that you will utilize the following tools:
- Colorblindness Simulator , where you can upload a photo (e.g., of a graph that you made), and it will output how the graph would look like if you have a certain type of colorblindness.
- Guide: Including Alt Text in Markdown files - this will give you some guide on how to include alt texts in your writeup.md report.
Keep the principles from the readings in mind as you design and implement your dashboard. You should try your best to utilize these best practices in your graphs for this assignment, and note the times during your design and implementation process where you could and could not act on suggestions in the readings. You will answer questions about your observations after having produced all the visualizations.
We hope that this will be a fun assignment and will closely resemble future data science work!
Gradescope Autograder & Collaboration
Due to the free-form nature of the assignment, we do not have an autograder configured on Gradescope. Feel free to talk to your friends or come to TA hours to get feedback on your graphs (e.g., “ does it make sense that I use graph X to communicate this information? ” or “ how do you feel about my design for graph Y ”). However, you should be the one who determines the design choices and comes up with the code to produce the graphs.
We have been and will continue to use Gradescope on our assignments to check for code similarity between submissions.
In this assignment, there is no one way to do things: You are the person to make the design choices to visually analyze your data/your models. The design choices that you will make in this assignment include:
- What are the questions that I will graphically analyze?
- What kinds of plot will I produce to analyze my questions? Why?
- Out of the many Python visualization tools/packages, what will I use to produce the plots? How can I use it to make my graphs the most informative and accessible?
In response to the scenarios posed in this assignment, all the code that you write to produce the the answer plots should go in their respective files (as noted in each section). In writeup.md , you will have to include the produced plots, and write your answer to each question that you decide to analyze. Particularly, be sure to mention:
- Question : What is the question/aspect you want to analyze?
- Graph & Analysis : Include the graph(s) that you use to analyze the question/aspect. How should we interpret the graph(s)? How should we use this information to judge the model/the dataset, or to decide the next steps in our data analysis?
If you are ever in doubt about whether an aspect of analysis is “valid,” feel free to reach out to your TAs for help!
We will go through your code file to make sure that the code that you wrote correspond to the graphs that you produce, so please make sure to structure your code in the cleanest way possible so we can give you credit. We expect best design for your graphs. This means:
- Your graphs have to have clear graph titles, axis labels, and need to generally communicate information properly.
- Your graph needs to follow accessible graph design principles - e.g., think colors, sizes, or alt texts. Please refer to the lab and to the accessible design tools/articles above for more information regarding accessible visualization.
- Your graph communicates information well on its own, but you also do a good job with analyzing your graphs (refer to the questions mentioned above).
You are free to use any kinds of plots, packages, etc. - as long as you include your graphs in the graphs folder (and in the writeup.md ) file.
We will use two datasets in this assignment: The RI Traffic Stops Dataset, and the Banknote Authentication Dataset. The datasets and their details (features, source, acknowledgements) can be found in the data/ folder.
The data is labeled. More specifically:
- RI Traffic Stops Dataset : By default, the name of the target feature is stop_outcome , and the names of the features are the rest of the attributes.
- Banknote Authentication Dataset : By default, the name of the target feature is Class , and the names of the features are the rest of the attributes.
Stage 1: Data Visualization in Data Exploration
Your code in this section goes into stage_one.py . You are expected to explore three aspects of your choice of your data (with at least one accompanying graph for each aspect). This applies to either of the provided datasets, which means that you only have to produce graphs to explore three aspects for both of the datasets, instead of having to address six aspects.
Some questions you might want to think about when exploring a dataset:
You want to build a Machine Learning model on the datasets, but as a stellar data scientist, you realize that you need to explore how the data distribution looks like first. What kinds of graphs will you produce to explore your data before you dive into building the model?
Hint : To build a good model, you may want to look at the distribution of the fields (of your interest) that exist in your dataset. For example:
- If a field consists of continuous numerical values: How are the values in this field distributed? What is the range, the median and standard deviation of the values in this field?
- If a field consists of categorical values: How many distinct categories can the values be divided into, if applicable?
- If your dataset has true target labels: Are the classes in your dataset are balanced (meaning, roughly the same amount of samples for each class)?
- Looking at multiple different fields, what is does the breakdown of the data look like? For example: looking at fields A and B , how many samples have: (A,B) = (a_1, b_1) ? (a_1, b_2) ? ( a_2, b_1 )? And so on…
A picture is worth a thousand words A graph is worth a thousand numbers. Before we decide what to do next with the data - e.g., which machine learning model to use - it is important to visualize the dataset (and not just each feature’s statistics).
In a supervised context where the dataset that you’re provided contains the target labels, you can just plot your data points from your original data and see if they are in " natural clusters" already. ( Hint: Take a look at the functions plot_multiclass_fig_2D , plot_fig_3D , plot_linear_line , and plot_linear_plane in sample.py ! ) You may want to explore if your data is linearly separable or are already clustered into almost distinct clusters - if it is, then you can just use super simple Machine Learning models - e.g., SVM with Linear Kernels, Logistic Regression, etc. - on your dataset, but if not, you’ll have to use more complex ones - e.g. SVM with more complex kernels, or deep neural networks.
Hint : For a dataset with many different attributes, it might be hard for us to plot more than 3 dimensions at once. To handle this problem, you can reduce the dimensionality of your dataset to be either 2-dimensional or 3-dimensional (using methods such as Principal Component Analysis, or regression and picking the “most important” subsets of variables). You can visualize one graph of your most important features, or you can produce a few different graphs to visualize different subset of features to derive your conclusions about the data.
Some examples of aspects that you can analyze :
- In the RI Transit Stops Dataset, how many examples (rows) are of each class? What are we to make of the kind of Machine Learning models that we should use on this dataset?
- In the Banknote Authentication Dataset, are the data points linearly separable / almost linearly separable? What are we to make of the kind of Machine Learning models that we should use on this dataset?
- In the RI Transit Stops Dataset, what does the breakdown of the data look like when looking specifically at the features 'driver_race' and 'search_conducted' ? How about 'driver_race' and 'stop_outcome' , or 'driver_race' and 'is_arrested' ? What can we say about the relationship between our features of interest, if at all?
- How does the number of traffic stops change through the years in the Transit Stops Dataset?
- What is the distribution of the continuous variables in the Banknote Authentication Dataset?
Stage 2: Data Visualization for Model Exploration
Your code in this section goes into stage_two.py . You are expected to explore three aspects of your choice of your Machine Learning models - again, with at least one accompanying graph for each aspect.
In utils.py , we have built the code to build four different Machine Learning models (decision tree, k-nearest neighbor, logistic regression, and dummy classifier) – examples for how to use our code to get the trained models can be found in sample.py . Feel free to build your own ML models, change the code that we have provided for you in utils.py , etc. – whatever that helps you produce the graphs!
- What are the true positive, false positive, true negative and false positive rates in each model? What can we say about the the performance of each model using this? You may find plotting confusion matrix useful to support your answer.
- What is the change in performance as you tweak your models? That is, how does changing the k in the k-nearest neighbor algorithm impact the accuracy? What is the accuracy of a model that is trained using only 3 features, in comparison to those that use 4 or 5 features?
- How do your models do in comparison to a baseline/ dummy model (a model that predicts using super simple heuristics like random guessing or most-likely-outcome guessing)?
- What is the decision making process that your model used to make the predictions? What were the splits that your decision tree made ), or what are the coefficients that your model assigned to each features? What can we say about these significant features in these models?
- In our ML models, we use sklearn’s OneHotEncoder to encode our non-numerical, categorical features. You can read more about OneHotEncoder here if you are not familiar, and the TAs are here to help!
Stage 3: Data Visualization for SICK Applications
Code in this section goes into stage_three.py .
In this section, you will produce a geographic map to visualize the traffic stop data per county in Rhode Island. We recommend that you use Plotly (and we have an example for you in sample.py ). You are expected to produce at least one geographic map of your choice!
Some examples of graphs that you can make:
- What is the total number of traffic stops through from 2005 - 2015 per each county?
- What does the disparity in traffic stops look like in each county? Think the difference between the number of traffic stops on drivers of some race versus some other (dividing by the number of traffic stops in that county).
We recognize that it is hard to have the hover effects that Plotly interactive graphs provide when just downloading and including a static image in the writeup. Therefore, you can just include a static graph in your writeup, and we will check your code to see your Plotly graph(s) on our machines later – no penalty at all!
Stage N: Extra Credit
Code in this section goes into stage_n.py
You can analyze up to three additional aspects of your choice, and we will give you at most five extra credit points per each additional aspect that you analyze. You will receive 2.5 points for producing a good graph (clear, accessible, makes sense for your goal of analysis, and clear graph analysis), and an extra 2.5 if the graph is unique to the other graphs that you have produced in this assignment, for a total of 5 points max per each graph.
For this section, we will give at most 15 points as extra credit towards the assignment.
(3 points) Please list at least three examples of accessible practices in data visualization.
(10 points) Evaluate the accessibility of the graphs that you produced. Specifically:
- What kinds of users might find your graphs accessible?
- Who might have more difficulty accessing your graphs?
- What further actions might you take to make this dashboard more accessible to a wider audience?
Your response should refer to at last one of the readings.
(7 points) Reflect on the stages of your design and implementation process.
- What did you find easy about factoring accessible visualization practices into your graphs?
- Think about the steps that you could not take/that you found hard to make your graphs more accessible. What are some factors that kept you from taking these steps?
- What do you think are some things everyone should do to make accessible design the norm?
As per usual, please run python3 zip_assignment.py to zip the assignment and submit onto Gradescope. The script will include all the files in your directory (e.g., all the code .py files, all your graphs, and the writeup.md of your report), except for the files in the data folder.
After submitting, please make sure that your Gradescope submission includes your filled-out writeup.md report and all the code that you use to produce the graphs in your report.
Note : Manually zipping your files risks (1) not including some files that will be used as part of our grading, and (2) your code not upholding our anonymous grading policy. Please use the zip_assignment.py script to zip and submit, or directly submit through Github.
Congratulations on finishing your last homework assignment in the course! We hope that our course has been educational and fun. We truly believe that you’re set to becoming an amazing data scientist.
We’d love to hear any constructive feedback about our course.
Last but not least, we hope that you enjoy our parting gift .
This assignment was made by Nam (ndo3) in Summer 2021.
Appendix A: Stencil Code Structure
The structure of the stencil is as follows:
code/ : Folder that contains all the code. You can make as many helper .py files as you want here, and they will all be included in the submission.
- stage_*.py : the respective Python files in which you will write your code to produce the answers for the assignment.
- get_ri_stops_df and get_banknote_df : Functions to load the data (as a DataFrame)
- get_trained_model : Function to get trained ML models (Logistic Regression, Decision Tree, K-Nearest Neighbors, and dummy baseline classifier)
- get_model_accuracy : Function to evaluate trained ML models
- sample.py : This contains examples for how you can draw certain kinds of plots using Matplotlib and Plotly , as well as for how you can use the functions in utils.py
data/ : Folder that contains all the data ( .csv files) and their README files (which contains information on what each attribute means and the data type).
graphs/ : Folder that should contain all the graphs that you will (1) include in writeup.md , and (2) submit to us.
writeup.md : You will include the graphs that you made in this assignment and your response to each stage in this file. Answers to the Socially Responsible Computing questions also go here.
Appendix B: How to install a new package in your own virtual environment
Note : The instructions below only work on your own virtual environment. If you are using the official course virtual environment on the department machine (e.g. at /course/cs1951a/venv ), you will not have the write permission to install new packages in the virtual environment – please reach out to the HTA if the package that you’re interested in using for the assignment is not available and you think should be added to the official course virtual environment.
Step by step instructions: If you run into any roadblock following these steps, feel free to come to TA hours for more support!
Suppose you are trying to import a package X to use in your Python program. In this example, we will use Seaborn package as X - using the import statement import seaborn as sns . However, Python is not happy about you using that statement, and gives you the error message ModuleNotFoundError: No module named X (in this case, ModuleNotFoundError: no module named 'seaborn' )
Step 1 : Activate your course virtual environment (e.g., using the cs1951a_venv command that we have set up in Homework 0, our using source PATH/TO/YOUR/VIRTUAL/ENVIRONMENT/bin/activate ). Try Googling how to activate your virtual environment ( this page might be helpful ) if you don’t know how to.
This step is to make sure that your module is installed to the virtual environment with which you will run your code for the assignment.
Step 2 : In your terminal, type in pip3 install <NAME-OF-MODULE> or python3 -m pip install <NAME-OF-MODULE> .
In the example above, the command would be pip3 install seaborn or python3 -m pip install seabon .
After you have successfully installed the module, the last line/one of the last lines displayed in your terminal should say "Successfully installed <MODULE-NAME>-<MODULE-VERSION>" (in my case, that would be seaborn-0.11.1 )
Step 3 : See whether the module is successful installed on your machine by running your program (that contains the import line again). If it does not (the ModuleNotFoundError shows up again), feel free to come to TA hours for help! Be sure that you are in the virtual environment when installing the module and when running the code that contains the module import statements.
All the work students will complete for evaluation and credit in the course is described below.
Tasks for each lesson are described here. Each task is designed to demonstrate a particular skill or idea from the lesson or prepare for the next lesson.
Tasks will usually be evaluated on a 0-10 scale on the following rubric:
- 0-5: Work missing or mostly incomplete.
- 6-7: Mostly complete or complete with major deficiencies.
- 8: Complete and meets expectations.
- 9-10: Complete and excels in some respect: organization, clarity, creativity.
Task 1 (Lesson 1)
Find two data visualizations that you find informative, compelling, or in need of improvement.
Create a document that shows each visualization (the figure, or a snapshot of a dynamic visualization), provides the source (e.g., url and publication details if applicable). In a few sentences describe
- the data behind the visualization,
- main message conveyed by the visualization,
- one or two features of the visualization that make it effective or suggestions for improvement.
The goal of this project isn’t to be right or wrong, but rather to start the process of looking at data visualizations through the perspective of creator, designer, and critic. It’s okay if you find visualizations from secondary sources and not the creator or original publisher. Include the reference to the source you used to find the visualiztions.
Submit your work as a single PDF on Brightspace.
Due Monday 11 January at 9:00 am Atlantic.
Bonus task (Lesson 2)
Read Healy, Sections 2.1-2.4 . There is nothing to submit for this task.
Task 2 (Lesson 3)
See the instructions in the syllabus or in Lesson 3 notes
- Install R, Rstudio, and the packages identified in Healy’s Preface .
- Install git.
- Create a github account.
- Login to rstudio.cloud . This is a complete R, Rstudio, and git setup “in the cloud” that can be used if you have trouble using R on your own computer.
- Ask for help with any of these tasks if you need it.
Answer the quiz on Brightspace which will ask if you were successful with each task or if you need help. Provide your github name in one of the quiz questions.
Due Monday 18 January at 9:00 am Atlantic.
Task 3 (Lesson 4)
Some people feel very strongly about the placement of 0 on the vertical scale of plots. Look again at the carbon dioxide plots in Lesson 1 . The vertical scale does not start at 0. Use the ideas in Healy Chapter 1.6 to describe how you would interpret vertical position on the carbon dioxide plots and how you could interpret this position if 0 was included on the vertical scale.
Hans Rosling’s visualizations (as shown in Lesson 1 ) use many channels for conveying data: x and y position, color, size, an annotation for year in the plot background. The interactive versions use an animation for change over time, and mouse-over pop-ups to identify the country for each dot. These are very complex visualizations!
- For Rosling’s plot showin in Lesson 1, what variables are shown for each of x and y position, color, and symbol size?
- According to Healy Chapter 1 which of these 4 features is most difficult to make quantitative comparisons with? Why? Do you agree?
- In your judgment, is this visualization effective or too complex? Watch the TED talk or experiment with the interactive version before answering the question. Does the live explanation provided in Rosling’s oral presentation help you interpret the plot?
Write answers for these two questions in a word processor (we’ll start using R markdown soon) and submit as a single PDF on Brightspace. Assesment: 10 points total, 5 points per question.
Bonus task (Lesson 5)
Repeat the examples from Lesson 5 and/or the accompanying video in Rstudio until you are comfortable with the basics of making a plot. We will learn much more about plotting starting the lesson after next.
There is nothing to submit for this task.
Task 4 (Lesson 6)
Use Rstudio to create a new project from the github repository for Assignment 1. This is a new task, but it’s going to be a recurring task throughout the course. Here are step-by-step instructions. If you have trouble with this task, ask for help. It’s very important.
- Go to your github account and look for a repository called “assignment-1-” followed by your github user name. Get the link by clicking the green button labeled “Code”. It should look like this: https://github.com/Dalhousie-AndrewIrwin-Teaching/assignment-1-<your github name>.git
- Using Rstudio on your computer, select menu File > New Project > From repository… > Git. Insert the link to the repository and tell R where to put the repository on your computer.
- If you use rstudio.cloud, start at the projects page and choose New Project from git repository (pop-up menu) or open your existing project for our course. Clone the github repository in your project using the terminal window and type the command git clone <link from github> . Finally, open the new folder in the Files pane and click on the “assignment-1.Rproj” link to switch to the project. You may be asked by rstudio.cloud to link to your github account to access private repositories.
There is a video to help you with this task. You can also look at the slides .
There is nothing to submit for this task, but please complete it by Monday 18 January or ask questions at office hours on Tuesday 19 January if you are having trouble. Everyone who submits Assignment 1 on time will get 10/10 for this task.
Bonus task (Lesson 7)
Answer the questions at the end of the mini-lecture video.
In addition, I suggest the following reading with nothing to submit:
- Browse the R graph gallery to explore the huge variety of visualiaations you can make with R. Practice thinking about how the language of aesthetic mappings, geometries, and scales can be used to describe these visualizations.
- Read through the ggplot cheatsheet to see how the concepts of the grammar of graphics will be connected to computer code. Don’t worry about the details – you will practice making visualizations using these tools over many future lessons.
Bonus task (Lesson 8)
Practice using ggplot to make graphs by reproducing examples from mini-lecture or notes, and modifying them by changing variables used in aesthetic mappings. If you would like, use other data sources described in “Data Sources” chapter in course notes.
Task 5 (Lesson 9)
Exercises to practice filter, group_by, summarize, and mutate.
I will add the repository for tasks 5 and 6 to your github account. Start by creating a new R project on your computer using the link to your repository you get from github. Edit the file task-5.rmd. When you are done, knit the file and commit the .rmd and .html files to your repository.
Task 6 (Lesson 10)
Exercises to practice making facetted plots. The questions are in the repository for tasks 5 and 6. Edit the file task-6.rmd. When you are done, knit the file and commit the .rmd and .html files to your repository. Push the changes to github to submit your work.
Bonus task (Lesson 11)
I will get you to practice reading files later on in the course. For now,
- pay attention to the code I use in future lessons for reading files, and
- download the excel file from the lesson, practice reading it into R, editing it in excel, and confirming that you can read the changes with R.
You will want to practice this a bit over reading week or just after when you are looking for data to be used in your term project.
Task 7 (Lesson 12)
Reshaping data practice. The tasks are in the repository for task 7 and 8.
Task 8 (Lesson 13)
Displaying tables practice. The tasks are in the repository for task 7 and 8.
Bonus task (Lesson 14)
The material in this lesson should be helpful if you run into challenges while working on Assignment 2, which asks you to develop new skills with unfamiliar functions.
Bonus task (Lesson 15)
Practice adding linear models and smooths to visualizations. Reproduce some of the examples from the course notes, mini-lecture, or course textbook. Create new visualizations of your own design by changing the model, data, or underlying visualization. Esperiment with colors and facets.
Task 9 (Lesson 16)
Exercises on linear models. See the file task-9.rmd in the repository for task 9 and 10.
Task 10 (Lesson 17)
Exercises on LOESS and GAM smooths. See the file task-10.rmd in the repository for task 9 and 10.
Task 11 (Lesson 18)
Clone the GitHub project “team-planning” to your computer (or rstudio.cloud workspace). Edit the file “team-planning.Rmd” to add your name and GitHub user ID to the teams table. If you want to work with someone as a team, add both team mates to the same line. If you want to be assigned a random team mate, add your name on a line by yourself.
Bonus task (Lesson 19)
Look for some data on the internet. Download the data to your computer. Read the data into R. Make a summary table describing some part of the data. Make a visualization using some of the data. Can you find any formatting errors in the data? Did you have any trouble reading the data into R?
You can use any data you like for this task. If you want a specific suggestion, get some data from gapminder.org or another source in the lesson.
Task 12 (Lesson 20)
We’ve been using R markdown to make reproducible reports throughout the course. In this task, practice using here and chunk options.
This task is in the repository for task 12, 13, and 14.
Task 13 (Lesson 21)
Create a PCA and MDS plot as described in the task-13.rmd file. This task is in the repository for task 12, 13, and 14.
Task 14 (Lesson 23)
Create a K-means analysis and accompanying visualization as described in the task-14.rmd file. This task is in the repository for task 12, 13, and 14.
Bonus task (Lesson 24)
Create and modify an “R presentation” slide presentation as described in the lesson. Ensure you know how to
- show graphics output, but not the code that generated it,
- make the graphic the right size to fit on the slide,
- open the HTML version in your web browser.
Bonus task (Lesson 27)
Practice making a map from the lesson
Task 15 (Lesson 28)
Make maps described in the task markdown file in the repository.
Task 16 (Lesson 29)
Show geographic data without maps as described in the repository.
Task 17 (Lesson 30)
Exercises with strings, factors, dates, and times.
Bonus task (Lesson 31)
Make a custom color scale using a web interactive tool and then use those colours on a plot. Select a palette from a list given in the lesson and use it in a visualization.
Task 18 (Lesson 32)
Learn to use theme elements as described in the repository.
That is the last task for the course!
Assignments are opportunities to apply and combine the skills from several lessons. They are both structured, in that you are asked to use specific skills to accomplish a task, and creative in that you have some flexibility in the product you produce. You will be assessed on your use of technical skills and your judgement in making well-designed and effective visualizations, following the principles explored in the course.
Assignments should be submitted to the relevant github repository, generally as an R markdown document.
Make a markdown document with a figure and commit to a github repository. Ensures you know the foundation of using git, github, R markdown, and ggplot to build new knowledge and skills on top.
Use the repository you created in Task 4.
Due Wednesday 27 January at 6:00 pm Atlantic time.
If, when you go to push your work back to github, you get an error about “Author identity unknown”. you need to type the following two commands into the R terminal window:
There are so many packages and functions to make visualizations, that its really important to be able to read documentation and learn new functions. Fortunately, the design of ggplot means that many functions work very similarly and so once you have learned the basics, it’s quite easy to learn more on your own.
The purpose of this assignment is for you to practice this sort of learning. I’ve picked out a few functions that work much like the examples we’ve looked at already. Your assignment is to pick out two of these and make a R markdown document describing how they work. This is the sort of practice I do all the time when I learn a new R skill.
Look in the repository assignment-2 for a template for this assignment.
Practice using methods developed in the course so far (summarizing, ggplot visualizations, linear regression, smooths) to explore a data set and answer questions about the data.
Tidy Tuesday is a weekly activity to support people learning to use R for data analysis and visualization. Each week a new dataset is posted and interested participants post their visualizations. Some are complex pieces of work by people with lots of experience, but many are the work of beginners just learning to make good visualizations. I encourage you to explore the datasets and example visualizations others have made as a source of ideas and inspiration.
For the next two assignments, I’ll select a few datasets and ask you to work with one of those for your assignment.
For this assignment you will make scatter plots with smooths (linear, loess, or gam) and dimensionality reduction (PCA or MDS). The goal is to gain some insight into the data and present some aspect of the data in a visually appealing way. You may be able to use the data as its presented, or you may need to transform it in some way first (for example using the dplyr tools). You should feel free to show a subset of the data if you think that makes a better visualization to highlight a particular feature of the data.
Present your work as a short R markdown report. You should describe the dataset, explain any analysis or transformations you did, present at least 2 visualizations, and describe the main messages conveyed by your visualization. Full instructions are in the repostory.
This is the second Tidy Tuesday assignment. Create maps as described in the repository.
Organize your work as a slide presentation. A template with instructions is in the repository.
This is a peer evaluation assignment. You should provide, in separate documents as described in the repository,
- confidential feedback on your team mates work for the term project
- a peer evaluation of two oral presentations from other teams, which will be shared with the presenters.
You should provide thoughtful and constructive feedback on the work of your classmates. Everyone’s work has good aspects and areas where there could be improvement; you should aim to provide useful feedback in both areas.
Your assignment for the peer evaluation is in the project planning repository we all share. Pull an updated version from github to see your assigment. Remember to do this before the first day of presentations so you know which presentation to evaluate!
A rubric and guide for your evaluations is in the repository for this assignment.
Your final project is an analysis on a dataset of your own choosing. You can choose the data based on your interests, based on work in other courses, or independent research projects. You should demonstrate many of the techniques from the course, applying them as appropriate to develop and communicate insight into the data.
You should create compelling visualizations of your data. Pay attention to your presentation: neatness, coherency, and clarity will count. All analyses must be done in RStudio using R.
- Proposal - due Friday 12 March at 6:00 pm
- Presentation - due Thursday 1 April or Tuesday 6 April during synchronous meeting.
- Report - due Friday 9 April at 6:00 pm
Work in groups of 2 according to the assignments made in the team planning repository. You can produce team products, or one product per team member, whichever you prefer. You have two roles in the project. First you will contribute your original creative work for the project. Second, you will act as a collaborator, providing your teammate with feedback, suggestions, debugging help, proofreading and other assistance as requested.
Use a single GitHub repository for the proposal, presentation, and final report. Use the repository I created for your team. See the notes on collaboration with GitHub for guidance. Contact me if you have trouble.
Teams will be created in late February. I will add everyone to a repository called team-planning . This repository will be used to create teams, schedule presentations, and organize peer-evaluation for Assignment 6.
If you would like to form a team with someone else in the class, edit the rmd document to add your request. If you would like to be assigned a randomly selected team mate, there will be a way to indicate that too. This process will also introduce you to some collaboration features of github. The deadline for selecting teammates will be Monday 1 March. I will review this list and finalize the assignments on Tuesday 2 March. I will create the presentation schedule and peer-evaluation assignments later that week.
Your main task for the proposal is to find a dataset to analyze for your project and describe at least one question you can address with data visualizations.
It is important that you choose a readily accessible dataset that is large enough so multiple relationships can be explored, but no so complex that you get lost. I suggest your dataset should have at least 50 observations and about 10 variables. If you find a bigger dataset, you can make a subset to work with for your project. The dataset should include categorical and quantiative variables. If you plan to use a dataset that comes in a format that we haven’t encountered in class, make sure that you are able to load it into R as this can be tricky depending on the source. If you are having trouble ask for help before it is too late.
Do not reuse datasets from any part of the course.
Here is list of data repositories containing many interesting datasets. Feel free to use data from other sources if you prefer.
- Awesome public datasets
- Bikeshare data portal
- Harvard Dataverse
- Statistics Canada
- Open government data: Canada , NS , and many other sources
- Other sources listed in the Data sources section of these notes
- Data you find on your own may be suitable too.
Describe a dataset and question you can address with the data for your proposal. Outline a plan to use five visualizations (e.g., data overview plot, dplyr/table summary, small multiples, smoothing/regression, k-means/PCA, map).
The repository contains a template for your proposal called proposal.rmd. Write your proposal by revising this file and using this template.
- Questions: The introduction should introduce your research questions
- Data: Describe the data (where it came from, how it was collected, what are the cases, what are the variables, etc.). Place your data in the /data folder. Show that you can read the data and include the output of dplyr::glimpse() or skimr::skim() on your data in the proposal.
- The outcome (response, Y) and predictor (explanatory, X) variables you will use to answer your question.
- Ideas for at least two possible visualizations for exploratory data analysis, including some summary statistics and visualizations, along with some explanation on how they help you learn more about your data.
- An idea of how at least one statistical method described in the course (smoothing, PCA, k-means) could be useful in analyzing your data
- Team planning: briefly decribe how members of your team will divide the tasks to be performed.
Assessment . See the file grade-proposal.rmd for the assessment guidelines and rubric.
The oral presentation should be about 5 minutes long. The goal is to present the highlights of your project and allow for feedback which can be incorporated as you revise your written report.
You should have a small number of slides to accompany your presentation. I have provided a template for you to use as presentation.rpres . I suggest a format such as the following:
- A title with team members’ names
- A description of the data you are analyzing
- At least one question you can investigate with your data visualization
- At least two data visualizations
- A conclusion
For suggestions on making slide presentations see the lesson on slides and recorded video.
Don’t show your R code; the focus should be on your results and visualizations not your computing. Set echo = FALSE to hide R code (this is already done in the template).
Your presentation should not just be an account of everything you tried (“then we did this, then we did this, etc.”), instead it should convey what choices you made, and why, and what you found.
Presentation schedule: Presentations will take place during the last two synchronous sessions of the course. You can choose to do your presentation live or pre-record it. You will watch presentations from other teams and provide feedback on one each day in the form of peer evaluations. The presentation schedule will be generated randomly.
Assessment . See the file grade-presentation.rmd for the assessment guidelines.
Pratice your presentation, as a team, using the course collaborate room or other videoconferencing tool!
Follow the template provided for your written report (report.rmd) to present your visualizations and insights about your data. Review the marking guidelines in grade-report.rmd and ask questions if any of the expectations are unclear.
Style and format does count for this assignment, so please take the time to make sure everything looks good and your tables and visualizations are properly formatted.
You and your teammate will be using the same repository, so merge conflicts will happen, issues will arise, and that’s fine! Pull work from github before you start, commit your changes, and push often. Ask questions when stuck. Look at the lesson on collaboration for help.
In your knitted report your R code should be hidden ( echo = FALSE ) so that your document is neat and easy to read. However your document should include all your code such that if I re-knit your R Markdown file I will obtain the results you presented. If you want to highlight something specific about a piece of code, you’re welcome to show that portion.
General criteria for evaluation:
- Content - What is the quality of research question and relevancy of data to those questions?
- Correctness - Are visualization procedures carried out and explained correctly?
- Writing and Presentation - What is the quality of the visualizations, writing, and explanations?
- Creativity and Critical Thought - Is the project carefully thought out? Does it appear that time and effort went into the planning and implementation of the project?
R for Data Analysis and Visualization
Econ 396 (fall 2017) tr 10:30-11:45, durp computer lab (first floor saunders), jonathan page, office hours.
Monday 2-3 PM and Tuesday 3-4 PM, or by appointment, Saunders 509, jrpage at hawaii dot edu.
Student Learning Objectives
- To be familiar with standard techniques for visualizing data, including heat maps, contour plots, etc.
- To be able to transform raw data into formats suitable for analysis
- To be able to perform basic exploratory analysis
- To be able to create data visualizations in R
There is no prerequisite for this course.
Introductory Statistics with Randomization and Simulation: Available as a free PDF ( https://www.openintro.org/stat/textbook.php?stat_book=isrs ) or for $8.49 on Amazon.
R Graphics Cookbook
RStudio Cheat Sheets
Grades for this course will be based on weekly assignments (30%), project assignments (30%), the project proposal (5%), the final project deliverable (20%), and final project presentation participation (15%).
Weekly assignments (30%)
Weekly assignments are short R excercises. Each exercise should take no longer than 15 minutes. You will typically be given time to complete the exercise in class the day the assignment is given. The assignment will be in the form of R Markdown file (*.Rmd). You will submit the completed assignments via classroom.google.com by the following class period.
Project assignments (30%).
Each week, leading up to the project proposal, you will be given an assignment that is designed to provide you with an organized workflow for approaching new data science projects. Project assignments are submitted via classroom.google.com , with the exception of the two presentations
Project proposal presentation (5%)
This presentation should be less than 2 minutes. You simply need to communicate the core question your project seeks to answer and the dataset(s) you will be using to answer this question.
Final project (20%)
The final project will be an R Markdown document which communicates your project question, the data you used, and your results. You will need to deliver both your R Markdown file and any necessary data for running the file.
Final project presentation participation (15%)
Your final project participation grade is based on a combination of your own presentation and the feedback you provide to your classmates.
The following schedule is tentative and subject to change. Typically, the Tuesday class will consist of the week’s R lecture. Depending on how quickly we get through the material, you will have time to work on your assignment that will be due before the following class period. On Thursdays, we will discuss a relevant topic, but you should have time to work on your project assignment for the week. That assignment will generally be due before the following class period, except for the last several weeks when you are completing your final project.
- Data R Sample Datasets
- Topic Data sources overview
- Project Assignment Indentify interesting datasets (include links to datasets) and questions
- Data ACS PUMS [CSV]
- Topic Anscombe’s Quartet
- Project Assignment Choose question and dataset (with link to your source) for your project
- Data Hawaii Tourism Authority [Excel]
- Topic Probability
- Project Assignment Write description of your question
- Data State of Hawaii Department of Business, Economic Development (DBEDT) [Excel]
- Topic Distributions
- Project Assignment Write description of your dataset(s)
- Data ACS Immigration [CSV]
- Topic JunkCharts Trifecta Checkup
- Project Assignment Create 2 descriptive plots of your datasets(s)
- Data SSA [Excel]
- Topic Intro to Inference
- Project Assignment Write a description of the data cleaning required for your project
- Data NOAA Wind [netCDF]
- Topic Confidence Interval
- Project Assignment Write a description of your planned approach
- Data BLS American Time Use Survey (ATUS) [TSV]
- Topic Project Proposal Description
- Project Assignment Work on project proposal presentation
- Data PSID [SPS, TXT (Fixed-Width)]
- Topic Present project proposal (<2 Minutes)
- Data Zillow Age of Real Estate Inventory Data [CSV]
- Topic Inference for Numerical Data
- Project Assignment Work on final project
- Data University of Michigan - Survey of Consumers [CSV]
- Topic Inference for Categorical Data
- Project Assignment Work on final project (cont.)
- Data Twitter [twitteR API]
- Topic Linear Regression
- Data UN Comtrade [CSV]
- Topic Multiple Regression
- Data IRS Statistics of Income [CSV]
- Topic Putting your work online
- Data NSF National Survey of College Graduates [DAT (Fixed-Width)]
- R Time-Series Modeling
- Final Project presentations
There are many useful resources you should be aware of while going through this course. I will attempt to keep this list updated as I become aware of more useful links:
RStudio’s List of Useful R Packages
Visual Tutorial on Histograms
R for Data Science - Grolemund and Wickham
Catalog of Visualization Types by Ferdio
Gary King - Quantitative Research Methodology
John Stasko - Information Visualization
Jenny Bryan - Data wrangling, exploration, and analysis with R
HarvardX Biomedical Data Science Open Online Training
Econometrics in R
Using R for Data Analysis and Graphics
You made it to the end of our whirlwind tour of data visualization principles! Congratulations!
Now you get to show off all the tools you learned with a beautiful, truthful, narrative visualization.
For your final project, you will take a dataset, explore it, tinker with it, and tell a nuanced story about it using at least three graphs.
I want this project to be as useful for you and your future career as possible—you’ll hopefully want to show off your final project in a portfolio or during job interviews.
Accordingly, you have some choice in what data you can use for this project. I’ve found several different high-quality datasets online related to the core MPA/MPP tracks. You do not have to choose a dataset in your given field (especially if you’re not an MPA or MPP student!) Choose whatever one you are most interested in or will have the most fun with.
Data from the internet
Go to this list of data sources and find something interesting! The things in the “Data is Plural” newsletter are often especially interesting and fun. Here are some different high-quality datasets that students have worked with before:
- U.S. Charities and Non-profits : All of the charities and nonprofits registered with the IRS. This is actually split into six separate files. You can combine them all into one massive national database with bind_rows() , or filter the data to include specific states (or a single state). It all depends on the story you’re telling. Source: IRS.
- Nonprofit Grants 2010 to 2016 : Nonprofit grants made in the US as listed in Schedule I of the IRS 990 tax form between 2010 to 2016. Source: IRS.
Federal, state, and local government management
- Deadly traffic accidents in the UK (2015) : List of all traffic-related deaths in the UK in 2015. Source: data.gov.uk .
- Firefighter Fatalities in the United States : Name, rank, and cause of death for all firefighters killed since 2000. Source: FEMA.
- Federal Emergencies and Disasters, 1953–Present : Every federal emergency or disaster declared by the President of the United States since 1953. Source: FEMA.
- Global Terrorism Database (1970–2016) : 170,000 terrorist attacks worldwide, 1970-2016. Source: National Consortium for the Study of Terrorism and Responses to Terrorism (START), University of Maryland.
- City of Austin 311 Unified Data : All 311 calls to the City of Austin since 2014. Source: City of Austin.
- 515K Hotel Reviews Data in Europe : 515,000 customer reviews and scoring of 1,493 luxury hotels across Europe. Source: Booking.com.
- Chase Bank Branch Deposits, 2010–2016 : Records for every branch of Chase Bank in the United States. This dataset is not quite tidy and will require a little bit of reshaping with gather() or pivot_longer() , since there are individual columns of deposits per year. Source: Chase Bank.
Here’s what you’ll need to do:
Download a dataset and explore it. Many of these datasets are large and will not open (well) in Excel, so you’ll need to load the CSV file into R with read_csv() . Most of these datasets have nice categorical variables that you can use for grouping and summarizing, and many have time components too, so you can look at trends. Your past problem sets and in-class examples will come in handy here.
Find a story in the data. Explore that story and make sure it’s true and insightful.
Use R to create multiple graphs to tell the story. You can make as many graphs as you want, but you must use at least three different chart types (i.e. don’t just make three scatterplots or three maps).
Export these figures as PDF files, place them in Adobe Illustrator (or InDesign or Gravit Designer or Inkscape), and make one combined graphic or handout where you tell the complete story. You have a lot of latitude in how you do this. You can make a graphic-heavy one-page handout. You can make something along the lines of the this , with one big graphic + smaller subgraphics + explanatory text. Just don’t make a goofy infographic . Whatever you do, the final figure must include all the graphics, must have some explanatory text to help summarize the narrative, and must be well designed.
Export the final graphic from Illustrator as a PDF and a PNG.
Write a memo using R Markdown to introduce, frame, and describe your story and figure. Use this template to get started . You should include the following in the memo:
- Executive summary
- Background information and summary of the data
- Explanation, description, and code for each individual figure
- Explanation and description for the final figure
- Final figure should be included as an image (remember ![Caption goes here](path/to/file.png) )
Remember to follow R Markdown etiquette rules and style—don’t have it output extraneous messages or warnings, include summary tables in nice tables, adjust the dimensions for your figures, and remove the placeholder text that’s in the template already (i.e. I don’t want to see stuff like “Describe and show how you cleaned and reshaped the data” in the final report.)
You should download a full example of what a final project might look like (but don’t make your final combined visualization look exactly like this—show some creativity!)
Upload the following files to iCollege:
- A memo introducing and describing your final graphic (see full instructions above)
- A standalone PDF of your graphic exported from Illustrator
- A standalone PNG of your graphic exported from Illustrator
No late work will be accepted for this project since it’s the last project and it counts as your final.
I will use this rubric to grade the final product:
I am happy to give feedback and help along the way—please don’t hesitate to get help! My goal is for you to have a beautiful graphic in the end that you’ll want to show off to all your friends, family, neighbors, employers, and strangers on the street—I’m not trying to trip you up or give you trick questions!
And that’s it. You’re done! Go out into the world now and make beautiful, insightful, and truthful graphics.
Go forth and make awesomeness.
Download a full example of what a final project might look like.
Here are some great examples of student projects from past versions of this class.
Travel runs in Yellowstone
Project description Final PDF Final PNG
Scripture use by The Killers
Last updated on July 31, 2021
Edit this page
Assignment 6 « Data Visualization »
Due: 11:59pm march 27, 2021 (et).
Bad Data Viz can be found everywhere. So, please let us indulge by listing just a few examples here:
- Due date: 11:59pm 3/23/2021
- Stencil: Github Link (Github instructions can be found here )
- Handin: Gradescope Link
- Files to submit: dashboard.html , main.js , main.css , README.txt
To illustrate the capabilities of D3, here are some stunning examples of visualizations created with it:
- Every line in Hamilton, visualized
- Life Expectancy
- Tokyo Wind Speeds
- Koala to the Max
We know that D3 code is easy to find online. Even if we provide you with reference implementations, do not copy any code.
Before you begin
Do the intro to viz lab.
- Start early!
- Do/review the D3 lab and refer to the solution code for part 2 for certain features (i.e. colors, tooltips, etc.)
- Look for resources online (DO NOT COPY!) and do your best
Accessible Data Visualization
- “Why Accessibility Is at the Heart of Data Visualization,” paying particular attention to the “Design equivalent experiences for all your readers” section.
- “A Comprehensive Guide to Accessible Data Visualization,” which provides specific suggestions to make data visualizations accessible to people with visual impairments.
We hope for some this will be a fun assignment and closely resemble future data science work!
Getting the Stencil
You can click here to get the stencil code for Homework 4. Reference this guide for more information about Github and Github Classroom.
The only stencil requirement that we have is that your code must consist of index.html , main.js , main.css , and relevant files. Please refer to the submission section of this handout for submission specifications.
To view your visualization in the browser, we're going to load the webpage via a local web server. Navigate to the directory containing your index.html file, and run python3 -m http.server 8000
You then can open a browser to the url http://localhost:8000/index.html to view your dashboard.
If you are using the department machine to code and run your python server, you can refer to this instruction to learn how to view your product locally.
Your task is to create your own informative D3 dashboard! This assignment, in particular, is very flexible. There are no strictly correct or incorrect answers since visualization is inherently subjective. That said, we will evaluate your work on a number of requirements listed below. Further, we expect that you incorporate concepts that Lorenzo has discussed in class like color palettes, font types/sizing, orientation, clarity, organization, and informativeness.
(5 points! Just for using the right data...)
You must use data from one of the following datasets. Each dataset has a series of leading questions you may use as inspiration. You should make sure that your dashboard answers these questions. Imagine your boss gives you 2 weeks to build a dashboard on these issues.
Note: All 3 datasets are provided in the data directory. Each topic includes a link to a Kaggle site with more information about the dataset.
- Your boss wants to know the top 10 video games of all time or top 10 for a specific year.
- Your boss wants to understand which genre is most popular. We'd like to see genre sales broken out per region. (This question can be answered by showing the top genre in each region if you want to implement a map, otherwise you should show genre sales broken down by region in bar/scatter/line/pie etc.)
- Lastly, your boss wants to know which publisher to pick based on which genre a game is. Your chart should provide a clear top publisher for each genre (could be interactive or statically show).
- Your boss wants to know the number of football games by year. You should show at minimum 5 years, but you can choose which years to show.
- Your boss wants to understand the top winning nations. We would like to see a winning percentage for the top 10 nations. You can show this in a map form if you would like to.
- Lastly, we are trying to bet on which team will win the world cup 2022. Over the last 2 world cups, which teams were top performing. You can decide how to interpret "top performing". A few approaches we would reccomend: winning percentage in the world cup, victory strength, strength of opponent. You may show any combination of those. We don't have a specific answer we expect, and you should explain your choice in the written questions.
- Your boss wants to know the number of titles per genre on Netflix.
- Your boss wants to understand the average runtime of movies by release year.
- Lastly, we want to learn about the cast and directors. You have two choices here: 1) the top director + actor pairs by number of movies made 2) a flow chart where each actor is a node, and a link refers to a movie they both acted in (just the connection, no need to specify number of movies made together or which movies those are)
- You must have 3 graphs (10 points each)
Unique graphs are bar, line, scatter plot, heatmap, density, area, etc. You can see examples here . In our stencil we have setup three boxes for you to place your graphs, but you can feel free to adjust these based on your graph selection.
We define this as a clickable/writeable element that updates the graphs/look of the dashboard.
We define this as information which appears when users hover over a data point‐ this can be combined with your clickable element, but a tooltip alone is not considered interactive.
- Make sure to write graph/dashboard titles/units/axis/tooltips where appropriate.
- Your dashboard should properly use color. If it's all black, that's bad. If it's a rainbow that's bad. This is also an area to consider the accessibility of your dashboard. Picking color schemes that are color blind friendly is a great habit and something which is very easy. Online tools (e.g. https://coolors.co/) can help show if you colors are color blind friendly. This is not a requirement, but if your colors do not complement your visualization (distracting, unnecessary, confusing) we will deduct points.
- You use the provided data (5 points)
We will evaluate your graph communication two-fold. Each graph will assessed 1) how well it communicates the information of the graph 2) how well it answers the question.
(up to +30 points).
You implement some form of dynamic stats calculation. By dynamic, we mean that it updates depending on which data is being shown. Our example dashboard calculates a regression line, but you can show box plot whiskers, calculate percentiles, calculate a t-test.
If you choose to add this, please add a note below your dashboard and written answers which in 1-3 sentences explains what you did and why it's statistical. (10 points)
You may attempt one or both of these and get up to an additional 20 points!
For some of the provided data sets and questions, we ask about geographic impact. One way to show this is with a bar/scatter/line graph. Another possibility is to show this on a geographic map! If one of your graphs is a "Map type" (https://www.d3-graph-gallery.com/) you earn an additional 10 points .
For other data sets we discuss the relationship between particular data points, particularly in a graph way. i.e. how many hops between 2 actors on Netflix. You can use D3 to make and visualize this graph! If one of your graphs is a a "Flow type" (https://www.d3-graph-gallery.com/) you earn an additional 10 points .
If you have any questions about whether a particular chart qualifies for either of these, please ask on piazza.
Created by CS1951A Spring 2020's Arvind Yalavarti, we have an example D3 dashboard here using the TA dataset from the D3 lab. Our dashboard goes above and beyond our expectations for you, but we though it would be helpful for reference, especially for those of you who want to go above and beyond.
Use of External Libraries
We've already included D3 and Bootstrap in the stencil code provided. To perform statistical calculations take a look at jStat and d3-regression .
If you would like to use either of these libraries, add the following lines to your index.html file:
Write answers to the following questions below your dashboard in index.html :
- Describe how your dashboard answers the questions presented. You don't have to address every question directly, but should at a high level address the main questions. (10 points)
- List 3 reasons why D3 was helpful and improved your visualization (6 points)
- List 3 reasons why D3 would not be the best tool for creating a visualization (6 points)
Accessible Data Visualization (10 points)
- Evaluate the accessibility of your dashboard based on the readings in the “Before you begin” section. What kinds of users might find this dashboard accessible and who might have more difficulty? What additional actions might you take to make this dashboard more accessible to all audiences? Your response should refer to at least one of the readings and be about 1 paragraph.
- Reflect on the stages of your design and implementation process when you could have taken steps to make your dashboard more accessible to all audiences. What are some factors that kept you from taking these steps? (a few sentences)
- If you’re interested in more considerations surrounding the social impact of data visualizations, check out the “Design Process Questions” and “Design Output Questions” for each principle in Section 3 of "Feminist Data Visualization."
- If you’re interested in the relationship between data analysis, visualization, and COVID-19 misinformation, check out “The Data Visualizations Behind COVID-19 Skepticism.”
After finishing the assignment, run python3 zip_assignment.py in the command line from your assignment directory, and fix any issues brough up by the script. It is crucial that you use the script provided to zip up the assignment. Our script generates an anonymous link which will be used to grade your assignment, and pushes your code to your repository at the moment of zipping to make sure that whatever you submit to Gradescope is the same as those that we see when grading your code.
This zipping script uses some Python packages that have yet been installed in the virtual environment. If you run into "No module named ...", use python3 -m pip install [MODULE-NAME-HERE] to fix your problem and zip up the assignment.
Somewhere in our zipping script, we call git commands ( git add . , git commit , and git push ) to push your code to your Github repository. The script should notify you if these actions fail: in which case, please manually push all your work onto your Github repository in addition to submitting your zip file ( dataviz-submission-1951A.zip ) onto Gradescope.
After the late submission deadline, we will send out instructions for everyone to set up the Github Pages site of their assignment, so your work can be seen by us (and the public). This is crucial for us to grade your assignment, and we expect everyone to set up their Github Page within 48 hours of our notification (by March 20th, 11:59PM ET ).
Please do not make your Github repository public or set up your Github Pages site before the late submission deadline has passed. Doing so will be treated as a violation of the collaboration policy, as your code will be visible to other members of the course.
Please make sure that your code that you submitted on Gradescope is the same as your code that is on your Github repository (we have tried to do so by calling git commands in zip_assignment.py ). During grading, we check the equality of your files to make sure that no changes have been made after your submission.
Make sure your code runs on your machine before handing in.
Make sure you're using the correct relative path when referencing any of the js/html files in your handin. There will be a deduction if there are problems viewing your visualization due to incorrect file paths.
This assignment was created in Spring 2020 by Arvind Yalavarti (ayalava2) and Joshua Levin (jlevin1), updated by Tiffany Nguyen (tnguye72), Neal Mahajan (nmahaja1), and Nam Do (ndo3) in Spring 2021. The accessibility component was added by Lena Cohen, Evan Dong, and Gaurav Sharma.
Data Analysis and Visualization in R (IN2339)
A practical introduction to data science.
Chair of Computational Molecular Medicine Technical University of Munich
This is the lecture script of the module Data Analysis and Visualization in R (IN2339).
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)
- Bootcamp Syllabus
- Lab Syllabus
- Reference Text
- Answers to Student Questions
- Student Questions
- Command line Reference Sheets
- Git Recipes
- Python, Numpy, Pandas, and Plotting
- Python and Pandas
- Matplotlib Cheat Sheet
- Class Sessions
In today’s assignment you will practice the principles of data visualization that were covered in the lecture. Visualizations should accurately and concisely convey a message in an easily understandable way without misleading the audience. Avoid extraneous visual elements.
You will start with some more directed visualizations of the GTEx RNA-seq dataset you analyzed last week, followed by a more open-ended project.
Re-load the data you downloaded in the previous lab session. These data comprise RNA-seq samples from whole blood from the GTEx Consortium (755 total individuals). As a reminder, they were downloaded directly from the GTEx portal and slightly reformatted to save you some time on tedious data wrangling. If needed, download again from the Dropbox links below:
Subject-level metadata Gene expression matrix
Excercise 1: Visualize characteristics of the RNA-seq data
In the first exercise, you’ll be exploring some aspects of the GTEx whole blood data and generating plots that communicate the observed patterns in an easy-to-interpret manner. You will be producing four figures. Each should be saved as it’s own separate file and uploaded as part of the assignment.
For all figures, label the axes appropriately, provide legends only when necessary, do not place a title or any other elements on the plot itself. You will be graded on proper labelling.
Create a plotting_exercise1.py script now for this exercise.
Step 1.0 : Load the data
In your plotting_exercise1.py script, load and normalize the GTEx data using the code below:
Step 1.1 : Distribution of expression across genes
For subject GTEX-113JC, plot the distribution of expression (logged normalized counts) across all genes, excluding any genes with 0 counts. Upload this figure for the assignment.
Step 1.2 : Expression of a single gene between sexes
For the gene MXD4, plot the distribution of expression (logged normalized counts) in males versus females. Upload this figure for the assignment.
Step 1.3 : Distribution of subject ages
Plot the number of subjects in each age category. Upload this figure for the assignment.
Step 1.4 : Sex-stratified expression with age
For the gene LPXN, plot the median expression (logged normalized counts) over time (i.e. in each age category), stratified by sex. Upload this figure for the assignment.
Exercise 2: Independent data visualization
We highly encourage you to do this exercise with a partner.
With your partner, select any data set from the TidyTuesday repository on GitHub: https://github.com/rfordatascience/tidytuesday . In this exercise, you’ll be exploring interesting patterns in the data set that you choose.
Create a plotting_exercise2.py script now for this exercise.
Step 2.1 : Initial exploration
Using Python ( pandas , numpy , matplotlib , etc.) explore these data with your partner, searching for any interesting features or patterns. Jot down any interesting patterns you observe as notes (no need to submit). For each feature/pattern you observe, think about what kind of plot would best communicate that feature/pattern.
Step 2.2 : Visualization
Choose three aspects/patterns of these data that are best represented by three different types of plots (e.g., histogram, bar plot, line plot, heatmap, etc.).
Generate these figures using matplotlib . As always, label the axes appropriately and avoid extraneous visual elements. For each plot, add a title that concisely states the message that your figure is attempting to convey.
For this assignment you should submit the following:
- Code to load in and normalize data (1 point)
- Code to create plot for Step 1.1 (0.5 point)
- Code to create plot for Step 1.2 (0.5 point)
- Code to create plot for Step 1.3 (0.5 point)
- Code to create plot for Step 1.4 (0.5 point)
- Code to load in data (1 point)
- Code for initial data exploration (1 point)
- Code to produce first plot (0.5 point)
- Code to produce second plot (0.5 point)
- Code to produce third plot (0.5 point)
- Plot for Step 1.1 (0.5 point)
- Plot for Step 1.2 (0.5 point)
- Plot for Step 1.3 (0.5 point)
- Plot for Step 1.4 (0.5 point)
- First plot for Exercise 2 (0.5 point)
- Second plot for Exercise 2 (0.5 point)
- Third plot for Exercise 2 (0.5 point)
Total Points: 10
DO NOT push any raw data! Only the things we asked for!
Advanced exercise 1.
For each age range, plot the proportion of samples within each group of the Hardy scale. This should be a single panel figure.
Advanced Exercise 2
We’d like to see if we can identify any broad patterns present in our gene expression data. To explore this, we’re going to cluster the data, both by sample as well as by gene.
To perform clustering, you’ll be using the dendrogram , linkage and leaves_list functions from scipy . The documentation for SciPy isn’t very helpful, but with some quick Googling you can find examples of how to use both of these tools.
Using linkage and leaves_list , cluster the filtered and log2 transformed gene expression data matrix for both genes and samples based on their patterns of expression (so both the rows and columns of the matrix). You will find the numpy transpose functionality useful in order to cluster the columns.
Plot a heatmap of the clustered gene expression data.
Using the dendrogram function, create a dendrogram relating the samples to one another.
- Fundamentals of Data Visualization Textbook
R Programming for Biologists
Teaching the tools to get computers to do cool science
Following this assignment students should be able to: understand the basic plot function of ggplot2 import ‘messy’ data with missing values and extra lines execute and visualize a regression analysis
- R for Data Science - Data visualisation
- ggplot2 documentation
Mass vs metabolism (20 pts).
The relationship between the body size of an organism and its metabolic rate is one of the most well studied and still most controversial areas of organismal physiology. We want to graph this relationship in the Artiodactyla using a subset of data from a large compilation of body size data (Savage et al. 2004). You can copy and paste this data frame into your program:
Make the following plots with appropriate axis labels:
- A plot of body mass vs. metabolic rate
- A plot of body mass vs. metabolic rate, with logarithmically scaled axes (this stretches the axis, but keeps the numbers on the original scale), and the point size set to 3.
- The same plot as (2), but with the different families indicated using color.
- The same plot as (2), but with the different families each in their own subplot.
Adult vs Newborn Size (25 pts)
It makes sense that larger organisms have larger offspring, but what the mathematical form of this relationship should be is unclear. Let’s look at the problem empirically for mammals.
Download some mammal life history data from the web. You can do this either directly in the program using read.csv() or download the file to your computer using your browser, save it in the data subdirectory, and import it from there. It is tab delimited so you’ll want to use sep = "\t" as an optional argument when calling read.csv() . The \t is how we indicate a tab character to R (and most other programming languages).
When you import the data there are some extra blank lines at the end of this file. Get rid of them by using the optional read.csv() argument nrows = 1440 to select the valid 1440 rows.
Missing data in this file is specified by -999 and -999.00 . Tell R that these are null values using the optional read.csv() argument, na.strings = c("-999", "-999.00") . This will stop them from being plotted.
- Graph adult mass vs. newborn mass. Label the axes with clearer labels than the column names.
- It looks like there’s a regular pattern here, but it’s definitely not linear. Let’s see if log-transformation straightens it out. Graph adult mass vs. newborn mass, with both axes scaled logarithmically. Label the axes.
- This looks like a pretty regular pattern, so you wonder if it varies among different groups. Graph adult mass vs. newborn mass, with both axes scaled logarithmically, and the data points colored by order. Label the axes.
- Coloring the points was useful, but there are a lot of points and it’s kind of hard to see what’s going on with all of the orders. Use facet_wrap to create a subplot for each order.
- Now let’s visualize the relationships between the variables using a simple linear model. Create a new graph like your faceted plot, but using geom_smooth to fit a linear model to each order. You can do this using the optional argument method = "lm" in geom_smooth .
Sexual Dimorphism Exploration (25 pts)
You are interested in understanding whether sexual size dimorphism is a general pattern in birds.
Download and import a large publicly available dataset of bird size measures created by Lislevand et al. 2007 .
Import the data into R. It is tab delimited so you’ll want to use sep = "\t" as an optional argument when calling read.csv() . The \t is how we indicate a tab character to R (and most other programming languages).
Using ggplot :
- Create a histogram of female masses (they are in the F_mass column). Change the x axis label to "Female Mass(g)" .
- A few really large masses dominate the histogram so create a log10 scaled version. Change the x axis label to "Female Mass(g)" and the color of the bars to blue (using the fill = "blue" argument).
- Now let’s add the data for male birds as well. Create a single graph with histograms of both female and male body mass. Due to the way the data is structured you’ll need to add a 2nd geom_histogram() layer that specifies a new aesthetic. To make it possible to see both sets of bars you’ll need to make them transparent with the optional argument alpha = 0.3 .
- These distributions seem about the same, but this is all birds together so it might be difficult to see any patterns. Use facet_wrap() to make one subplot for each family.
- Make the same graph as in the last task, but for wing size instead of mass. Do you notice anything strange? If so, you may have gotten caught by the use of non-standard null values. If you already noticed and fixed this, Nice Work! If not, you can use the optional na.strings = c(“-999”, “-999.0”) argument in read.csv() to tell R what value(s) indicated nulls in a dataset.
Sexual Dimorphism Data Manipulation (30 pts)
This is a follow up to Sexual Dimorophism Exploration .
Having done some basic visualization of the Lislevand et al. 2007 dataset of bird size measures you realize that you’ll need to do some data manipulation to really get at the questions you want to answer.
In Sexual Dimorophism Exploration you created a plot of the histograms of female and male masses by family. This resulted in a lot of plots, but many of them had low sample sizes.
Use dplyr to filter out null values for mass, group the the data by family, and then filter the grouped data to return data only for families with more than 25 species. Add a comment to the top of the block of code describing what it does.
Now, remake your original graph using only the data on families with greater than 25 species.
Sexual size dimorphism doesn’t seem to show up clearly when visually comparing the distributions of male and female masses across species. Maybe the differences among species are too large relative to the differences between sexes to see what is happening; so, you decide to calculate the difference between male and female masses for each species and look at the distribution of those values for all species in the data.
In the original data frame, use mutate() to create a new column which is the relative size difference between female and male masses
(F_mass - M_mass) / F_mass
and then make a single histogram that shows all of the species-level differences. Add a vertical line at 0 difference for reference.
Combine the two other tasks to produce histograms of the relative size difference for each family, only including families with more than 25 species.
Save the figure from task 3 as a jpg file with the name sexual_dimorphism_histogram.jpg .
Instantly share code, notes, and snippets.
derka6391 / main.cpp
- Star 0 You must be signed in to star a gist
- Fork 0 You must be signed in to fork a gist