Final Project

Description and guidelines

  • This project’s learning objectives are to have students practice and synthesize the data science concepts learned throughout the semester in CSCI 104.

  • Students will focus on addressing their choice of quantitative questions via real datasets.

  • This is a group project with 2-3 members. All work is to be completed by your group members alone, and all members should contribute equally. If there are any discrepancies, please notify the instructors.

  • Groups will turn in a single Jupyter notebook as their final deliverable. This notebook will be a mix of (1) cells with Python code and (2) English prose in Markdown cells explaining the code, data, visualizations and analyses. All data files should uploaded to the project folder, using the steps from Lab 5.

  • Please produce a notebook that is readily understood by others. Write descriptive prose to describe what you are doing and what you’ve learned from your work. Organize your code well, comment it, use good variable names, etc. Also, present your data well: use descriptive labels for columns in tables, axes on plots, etc.

  • We provide the template notebook here.

  • You may consult the text, your notes, your lab work, our lecture examples, and the web pages associated with the course web page. You may also consult other references, but please clearly cite any sources you use and attribute any code or ideas to where you found them. If use a source on the web, please provide the full URL and any author information you can identify.

Deliverables and Dates

Checkpoint: draft of Sections 0-3 (Due to Gradescope: 11/18 at 4pm)

  • Using the provided template notebook, complete a first draft of sections 0-3 in the rubric below.

  • Instructors will provide feedback on these sections to make sure the questions posed and the datasets chosen are on the right track for success.

  • This is a draft to help gauge work after Thanksgiving break, and so these parts can be changed after feedback.

  • Think about which techniques will help you answer your quantitative questions. Also, think about whether your quantitative questions support the requirements that you perform statistical inference as part of your work: hypothesis testing, estimation, or prediction.

Instructor meeting 1 (Takes place: 12/07–2/09)

  • You will sign up for short meetings between your group members and the instructors. Sign-ups will be emailed out a few days before.

  • You will give an update about the project. Please use this as a time to chat with the instructors about chgallenges you are facing.

Instructor meeting 2 (Takes place: 12/13–12/15)

  • You will sign up for short meetings between your group members and the instructors. Sign-ups will be emailed out a few days before.

  • You will show us a (nearly) final version of the project

Project notebook due (Due to Gradescope: 12/16 4pm)

  • Your final deliverable will be a Jupyter Notebook. Only submit one notebook per group. This should be a mix of (1) cells with Python code and (2) English prose in Markdown cells explaining the code, data, visualizations and analyses.

Final Project Rubric (100 points)

Component

Requirements

Points

0.

Description of Data

Please tell us where you found the dataset(s) and what they include in general terms.

5

1.

Quantitative questions

Poses at least two quantitative questions about the dataset(s).

5

2.

Loading data

Loads at least one dataset. You may load more than one dataset if they support your overall quantitative questions. Clean the data, as in Lab 5. We provide the same library functions.

5

3.

Descriptive statistics

Uses code and text to provide at least three descriptions of the dataset (e.g. number of rows, mean of one of the columns)

10

4.

Data wrangling

Uses at least two Table methods (e.g. sort, where, take, apply, pivot, join, group) to do something meaningful with the data. Describes (in full English sentences) what those Table methods are doing.

10

5.

Visualizations

Creates at least two visualizations of the dataset (e.g. a scatter and line plot, or two histograms). Describes (using full English sentences) any interesting findings from the visualizations.

10

6.

Statistical Inference

Categories: (A) Hypothesis test (B) Estimation (e.g. confidence intervals via bootstrapping), (C) Prediction (e.g. correlation or a linear regression line fit from a scatter plot). Correctly completes at least two statistical inference procedures (e.g. a hypothesis test and a bootstrap confidence interval; or two predictions). Discusses (in full English sentences) the implications of the statistical inference procedures.

30

7.

Ethics

Discusses (in full English sentences) at least one possible ethical consideration of using the dataset or doing analysis of the dataset (e.g. the potential harms of using the data from a Consequentialist or Deontologist perspective).

5

8.

Conclusions

Describes (in full English sentences) what has been learned and addresses the original quantitative questions.

10

9.

Mastery and creativity

A truly masterful data science project will go above and beyond these minimum requirements and creatively incorporate the concepts we have learned and practiced in this class. Feel free to go beyond the scope of what we have learned in this class if you have completed all other requirements.

10

Total

100