Curve-fitting Project – Linear Model (due at the end of Week 5)
For this assignment, collect data exhibiting a relatively linear trend, find the line of best fit, plot the data and the line, interpret the slope, and use the linear equation to make a prediction. Also, find r2 (coefficient of determination) and r (correlation coefficient). Discuss your findings. Your topic may be that is related to sports, your work, a hobby, or something you find interesting. If you choose, you may use the suggestions described below.
A Linear Model Example and Technology Tips are provided in separate documents.
Tasks for Linear Regression Model (LR)
(LR-1) Describe your topic, provide your data, and cite your source. Collect at least 8 data points. Label appropriately. (Highly recommended: Post this information in the Linear Model Project discussion as well as in your completed project. Include a brief informative description in the title of your posting. Each student must use different data.)
The idea with the discussion posting is two-fold: (1) To share your interesting project idea with your classmates, and (2) To give me a chance to give you a brief thumbs-up or thumbs-down about your proposed topic and data. Sometimes students get off on the wrong foot or misunderstand the intent of the project, and your posting provides an opportunity for some feedback. Remark: Students may choose similar topics, but must have different data sets. For example, several students may be interested in a particular Olympic sport, and that is fine, but they must collect different data, perhaps from different events or different gender.
(LR-2) Plot the points (x, y) to obtain a scatterplot. Use an appropriate scale on the horizontal and vertical axes and be sure to label carefully. Visually judge whether the data points exhibit a relatively linear trend. (If so, proceed. If not, try a different topic or data set.)
(LR-3) Find the line of best fit (regression line) and graph it on the scatterplot. State the equation of the line.
(LR-4) State the slope of the line of best fit. Carefully interpret the meaning of the slope in a sentence or two.
(LR-5) Find and state the value of r2, the coefficient of determination, and r, the correlation coefficient. Discuss your findings in a few sentences. Is r positive or negative? Why? Is a line a good curve to fit to this data? Why or why not? Is the linear relationship very strong, moderately strong, weak, or nonexistent?
(LR-6) Choose a value of interest and use the line of best fit to make an estimate or prediction. Show calculation work.
(LR-7) Write a brief narrative of a paragraph or two. Summarize your findings and be sure to mention any aspect of the linear model project (topic, data, scatterplot, line, r, or estimate, etc.) that you found particularly important or interesting.
You may submit all of your project in one document or a combination of documents, which may consist of word processing documents or spreadsheets or scanned handwritten work, provided it is clearly labeled where each task can be found. Be sure to include your name. Projects are graded on the basis of completeness, correctness, ease in locating all of the checklist items, and strength of the narrative portions.
Here are some possible topics:
Choose an Olympic sport — an event that interests you. Go to http://www.databaseolympics.com/ and collect data for winners in the event for at least 8 Olympic games (dating back to at least 1980). (Example: Winning times in Men’s 400 m dash). Make a quick plot for yourself to “eyeball” whether the data points exhibit a relatively linear trend. (If so, proceed. If not, try a different event.) After you find the line of best fit, use your line to make a prediction for the next Olympics (2014 for a winter event, 2016 for a summer event ).
Choose a particular type of food. (Examples: Fish sandwich at fast-food chains, cheese pizza, breakfast cereal) For at least 8 brands, look up the fat content and the associated calorie total per serving. Make a quick plot for yourself to “eyeball” whether the data exhibit a relatively linear trend. (If so, proceed. If not, try a different type of food.) After you find the line of best fit, use your line to make a prediction corresponding to a fat amount not occurring in your data set.) Alternative: Look up carbohydrate content and associated calorie total per serving.
Choose a sport that particularly interests you and find two variables that may exhibit a linear relationship. For instance, for each team for a particular season in baseball, find the total runs scored and the number of wins. Excellent websites: http://www.databasesports.com/ and http://www.baseball-reference.com/
Curve-fitting Project – Linear Model
This project is based on data from MyFoodDiary, a site that provides nutritional information (retrieved from: https://www.myfooddiary.com/foods/search?q=cheese+pizza+12%22). This is based on nutritional facts of Pizza, based on nutritional content and value. While a slice of pizza may not be equal for the different brands sampled in the dataset, the data points express “Fat Content” in grams, this measures specifically the total fat content of the pizza in a single serving, yet this is mostly made of the saturated fat, the nutritionally dangerous component. Moreover, the fat content can be expressed in percentage, a proportion of the whole compared to other nutritional contents of the slice of pizza but this percentage measure is not considered in the linear modelling. Further, the calorie count in the slices of pizza is also recorded. A calorie is a unit that is used to measure energy. When excess counts of calories are ingested by humans, it is likely to lead to weight gain and more serious lifestyle problems like diabetes and so it necessitates that the calorie measure is observed by food consumers (Foster, et al., 2010).
In this case, cheese pizza is the fast-food picked for assessment. As prevalently loved, pizza’s high fat content is hypothetically connected to high calories. This is based on the scientific fact that fats are the greatest providers of calories as compared to carbohydrates and proteins, as also articulated by N. H. S. (2013). Therefore, the high fat content of cheese pizzas are likely to lead to high calorie count. Therefore, a linear model is formulated to aid in predicting the calorie count for each pizza serving to enhance lifestyle watch for pizza lovers. This is intricately connected with the knowledge that 2,000 calories a day is used for general nutrition advice. But it has generally become increasingly difficult to trust the food manufacturers with their information on the foods as consumers become more responsible and concerned based on the exponential progression of lifestyle diseases and complications (Nestle, 2013). Therefore, this model is a helpful tool for enhancing food consumer responsibility.
Fig 1: Scatter plot of Calories and Fat Content
The scatter plot of the data points indicate linear progression in an increasing direction. This is further confirmed by the best line of fit as seen below.
Fig 2: Scatter plot of Calories and Fat Content with line of best fit.
This line of best fit indicates that with the progression of fat content of Cheese Pizza, there is an integral progress in calorie count. The line of best fit shows us what would be in the data points if there was no random error. In this sense, an increase in fat content in the cheese pizzas leads to an increase in the level of calories. In other words, brands with high fat content will most likely have high calories. This plot indicates that about 83.1% of the variation in Calories as the dependent variable is explained by the model that as Fat Content as the independent variable. The equation of the regression line is y = 14.699x + 98.515.
Determining the Slope of the Line
The coordinates chosen for determining the slope of the curve are (5, 172.01) and (20, 392.495).
m = (y2 – y1)/(x2-x1)
= (392.495 – 172.01)/(20-5)
Generally, the slope is 14.699
The slope or coordinate of geometry is the measure of the steepness of the line of best fit, or the regression line. It expresses the change in y for a unit change in x along the line. It represents the rate in change in the predicted variable as the independent variable changes. In this case, the slope (14.699) describes the predicted values of Calories given values of Fat Content in the Cheese Pizzas.
Coefficients of Determination and Correlation
R2 = 0.831
r = 0.9116
The high coefficient of determination (R2), which is close to 1, indicates that the line is a good fit for the data.
The data exhibits a very strong positive correlation between fat content and calories (0.912).
There is a very strong linear relationship between Fat Content and Calories, this indicates that as Fat Contents increase in the various Cheese Pizza brands, the Calories count will also increase highly and voce-versa.
The equation of the linear model formulated is y = 14.699x + 98.515
Where y is the Calories and
x denotes the Fat Content.
In this case, to predict the level of calories of a brand of cheese pizza slice that has a fat content of 29g in a serving, the results are as follows:
y = 14.699x + 98.515
= 14.699×29 + 98.515
Therefore, it is predicted from the linear model that a slice of cheese pizza with 29g fat content will provide 524.8 calories.
The linear regression model that fits the data is amazingly interesting in the sense that it explains a large part of the variation in the predicted variable, and that the best line of fit is significantly a good fit for the data. The data points do not only express meaningful values, but significant. Therefore, the model is generally good, fitting use for prediction with warranting significance.
The scatter plot with the best line of fit is particularly interesting as it indicates that there is a possibility of several lines of fit for the data, but the best line of fit, which is the linear model in this case, caters for up to 83.1% of the variation in the dependent variable – Calories. The line is determined to be a good fit based on the coefficient that is equal to 1.
Moreover, there is a confirmation of the data that there cannot be a regression model unless there is correlation between the dependent and independent variables. The very strong correlation between Fat Content and Calories variables validate the model and its strength in prediction. The formulated model, then, can be to predict fat content value by use of the regression technique while assuming the occurrences of random errors that are not catered for in the best line of fit.
The strong positive correlation finding in this project is highly consistent with expectations based on nutrition science, a gem which also forms part of the anchorage in the project by Popkin, Adair, and Ng (2012). In a society that is faced with a rising menace of obesity and general lifestyle diseases, the necessity to examine and assess contents of ingested foods escalates and this linear model will be very timely in the paradigm shift. Therefore, the meaningfulness of the model is unquestionable.
This linear model can be used by both food manufacturer and consumer. The former in determining the right levels of fats to include in the cheese pizzas for marketability of brands and for social responsibility of production, while the latter can find meaningfulness in this model in choosing the best brands basing on the Calorie question.
Choices, N. H. S. (2013). Healthy eating. Online information available at http://www. nhs. uk/livewell/healthy-eating/Pages/Healthyeating. aspx (accessed June 2015).
Foster, G. D., Wyatt, H. R., Hill, J. O., Makris, A. P., Rosenbaum, D. L., Brill, C., … & Zemel, B. (2010). Weight and metabolic outcomes after 2 years on a low-carbohydrate versus low-fat diet: a randomized trial. Annals of internal medicine, 153(3), 147-157.
Nestle, M. (2013). Food politics: How the food industry influences nutrition and health (Vol. 3). Univ of California Press.
Popkin, B. M., Adair, L. S., & Ng, S. W. (2012). Global nutrition transition and the pandemic of obesity in developing countries. Nutrition reviews, 70(1), 3-21.