Video Notes

Daily Video 1

Learning Objectives

  • Identify the challenges associated with processing data.

Essential Knowledge

  • The ability to process data depends on the capabilities of the users and their tools.
  • Data sets pose challenges regardless of size, such as:
    • the need to clean data
    • incomplete data
    • invalid data
    • the need to combine data sources
  • Depending on how data is collected, it may not be uniformed
  • Cleaning data is a process that makes the data uniform without changing their meaning.
  • Problems of bias are often created by the type or source of data being collected. Bias is not eliminated by simply collecting more data.
  • The size of data affects the amount of information that can be extracted from it.
  • Large data sets are hard to process using a single computer and may require parallel systems.
  • Scalability of systems is an important consideration when working with data sets, as the computational capacity of a system affects how data sets can be processed and stored.

Where do we start with data?

Collecting Data:

Issues to consider:

  • Source
    • What is the source?
    • Do you need more sources?
  • What tools do you have to analyze?

Processing Data is affected by Size:

  • How much information can we get?
  • Can one computer handle our task(s)?
    • Maybe we have to use parallel processing
      • Use two or more processors to handle different parts of the task
      • Check the scalability

Is there potential Bias?

  • Intentional:
    • Who collected the data?
    • Do they have an agenda?
  • Unintentional:
    • How is the data collected?
    • Who collected the data?

Data Cleaning:

  • Identify incomplete, corrupt, duplicate, or inaccurate records
  • Replacing, modifying, or deleting the “dirty” data
  • Be careful about modifying or deleting!
    • Be sure there is a mistake
    • Keep records of what is modified/deleted and WHY
    • Invalid data may need to be modified – keep from consistent

Daily Video 2

Learning Objectives

  • Describe what information can be extracted from data
  • Describe what information can be extracted from metadata

Essential Knowledge

  • Information is the collection of facts and patterns extracted from data.
  • Data provides oppurtunities for identifying trends, making connections, + addressing problems
  • Digitally processed data may show correlation between variables. A correlation found in data does not necessarily indicate that a casual relationship exists. Additional research is needed to understand the exact nature of the relationship.
  • Often a single source does not contain the data needed to draw a onclusion
  • Metadata is data about data
  • Changes and deletions to metadata do not change the primary data
  • Metadata is used for finding, organizing, + managing information
  • Metadata can increase the use of effective use of data or data sets by providing additional information
  • Metadata allow data to be structured + organized

What is metadata?

  • Prefix meta
    • behind, among, between
    • metadata = behind the scenes data
  • Some data has information about itself
    • Author
    • Date
    • Length/Size
  • Why care about this?
    • helps identify information
    • track/organize
    • process

Examples of metadata

  • Data - photo/image
    • Date
    • Time
    • Location
    • Height
    • Width
    • Pixels
    • Types of compression
  • Data - Book/Document
    • Author
    • Title
    • Genre
    • Date of Publication
    • Length
  • How might this be used?
    • Suggest Other books
    • Organize inventory

Finding Patterns

Data allows us to gain knowledge:

  • Look for trends
  • Look for patterns
  • Answer questions

  • Be careful!
  • Look out for misleading trends/patterns
  • Correlation does not imply causation

Quiz!

Question 1: A researcher is analyzing data about students in a school district to determine whether there is a relationship between grade point average and number of absences. The researcher plans on compiling data from several sources to create a record for each student.

The researcher has access to a database with the following information about each student.

  • Last name
  • First name
  • Grade level (9, 10, 11, or 12)
  • Grade point average (on a 0.0 to 4.0 scale)
  • The researcher also has access to another database with the following information about each student.

  • First name
  • Last name
  • Number of absences from school
  • Number of late arrivals to school

Upon compiling the data, the researcher identifies a problem due to the fact that neither data source uses a unique ID number for each student. Which of the following best describes the problem caused by the lack of unique ID numbers?

  1. Students who have the same name may be confused with each other.

  2. Students who have the same grade point average may be confused with each other.

  3. Students who have the same grade level may be confused with each other.

  4. Students who have the same number of absences may be confused with each other.

My Answer:

1, because it is possible that students have the same first name and last name.

Correct Answer:

Question 2: A team of researchers wants to create a program to analyze the amount of pollution reported in roughly 3,000 counties across the United States. The program is intended to combine county data sets and then process the data. Which of the following is most likely to be a challenge in creating the program?

  1. A computer program cannot combine data from different files.

  2. Different counties may organize data in different ways.

  3. The number of counties is too large for the program to process.

  4. The total number of rows of data is too large for the program to process.

My Answer:

2, because within data sets there are problems with cleaning data and making it accurate.

Correct Answer:


Question 3: A student is creating a Web site that is intended to display information about a city based on a city name that a user enters in a text field. Which of the following are likely to be challenges associated with processing city names that users might provide as input?

Select two answers.

  1. Users might attempt to use the Web site to search for multiple cities.

  2. Users might enter abbreviations for the names of cities.

  3. Users might misspell the name of the city.

  4. Users might be slow at typing a city name in the text field.

My Answer:

2 and 3 because they seem the most reasonable when it comes to challenges, Cities like LA is an abbrievation for Los Angeles. And misspelling is a common human error.

Correct Answer:

Already Correct!

Question 4: A database of information about shows at a concert venue contains the following information.

  • Name of artist performing at the show
  • Date of show
  • Total dollar amount of all tickets sold

Which of the following additional pieces of information would be most useful in determining the artist with the greatest attendance during a particular month?

  1. Average ticket price

  2. Length of the show in minutes

  3. Start time of the show

  4. Total dollar amount of food and drinks sold during the show

My Answer:

It could be 1 or 4 because normally more expensive tickets mean a better/popularer artist, but that is a Bias so I’m going to choose 4.

Correct Answer:

1, The total dollar amount of food and drinks sold during a show may be correlated with the attendance at the show, but cannot be used to determine the exact number of attendees.

Question 5: A camera mounted on the dashboard of a car captures an image of the view from the driver’s seat every second. Each image is stored as data. Along with each image, the camera also captures and stores the car’s speed, the date and time, and the car’s GPS location as metadata. Which of the following can best be determined using only the data and none of the metadata?

  1. The average number of hours per day that the car is in use

  2. The car’s average speed on a particular day

  3. The distance the car traveled on a particular day

  4. The number of bicycles the car passed on a particular day

My Answer:

2, I used the process of elimination, it could be one but im thinking it’s 2 more because the speed is one of the more important pieces of data. Distance can’t be captured because we are only using data and not metadata. And the number of biciycles seems kind of random.

Correct Answer:

4, The calculation of average speed would be based on speed and time, which are part of the metadata.

Question 6: A teacher sends students an anonymous survey in order to learn more about the students’ work habits. The survey contains the following questions.

  • On average, how long does homework take you each night (in minutes) ?
  • On average, how long do you study for each test (in minutes) ?
  • Do you enjoy the subject material of this class (yes or no) ?

Which of the following questions about the students who responded to the survey can the teacher answer by analyzing the survey results?

I. Do students who enjoy the subject material tend to spend more time on homework each night than the other students do? II. Do students who spend more time on homework each night tend to spend less time studying for tests than the other students do? III. Do students who spend more time studying for tests tend to earn higher grades in the class than the other students do?

  1. I only
  2. III only
  3. I and II
  4. I and III

My Answer:

3, because both correlate to the questions the teacher asked.

Correct Answer:

Already Correct!!