Built upon the use case, we identify data sources and types, and describe data analysis, and visualization
Step 1: Review the sequence of smaller stories, and the list of important information. Modify, expand, or narrow down your story. Accordingly, revise the list of variables (important information).
Step 2: Explain why you need these data, and describe data types, sources, and possible statistical analysis. Describe how you will use the data in modeling.
Step 3: create a visual/image, e.g. in PPT, to demonstrate the expected visualization of results.
The first step is to transform the use case into a sequence of small stories, or in other words, users or consumers’ decision-making steps. Here is an example based upon UMKC’s EduKC:
Step 1: What kind of camps will be available during the spring break?
Step 2: What do the kids like?
Step 3: How long would it take to commute between camps, work, and home?
Step 4: What kind of camp schedule would be best for my work schedule?
Step 5: Shall I send them to the same camp or different camps?
Step 6: Do I have to prepare lunch for them?
Step 7: How much can I afford for the camp?
Step 8: Will it be worth of sending them to camps?
Q1. What kind of data do you need? Elaborate this for each small story.
You can review the list of important information in the use case, and decide whether you want to make any modifications.
You can decide whether you are going to pursue the bigger story or decide to focus on a smaller story, or a sub-‐story of the main story.
Q2. Why do you need the data?
Read the use case and then pretend to be the user. Justify users’ needs, prioritize them, and identify data needed to fulfill each need.
E.g., #1. “I” need to make sure these are local spring break camps, and they are open to male students between 9 to 12. That is why “I” have these variables:
The starting date of the spring break
The ending date of the spring break
Gender requirements of the camp
Age requirements of the camp
Starting time of the camp
Ending time of the camp
Location of the camp
The distance of the camp location to my home
#2. “I” want to know whether my kids may be interested in the camp, so “I” include this variable: What the camp is about, STEM, art, or sports?
#3. “I” want to ensure my kids’ safety, health, and educational experience. That is why “I” have these variables:
What other parents say about the camps (educational or not)
Whether the facility is safe
Whether the facility is clean
The age of the building
#4. “I” want to make sure my kids feel socially comfortable in a camp with roughly same numbers of boys and girls, or more boys. Their friends are likely to choose the same camp. That is why “I” have these variables:
The home school of the kids participating in the camp
The ethnicity of the kids participating in the camp
The historical gender-‐prevalence of co-‐ed camps
Q3. How do you use the data in your Machine Learning applications?
- E.g., users will input date and age, to compare with the age requirement and date data of camps.
- Google map location data will be used to create the map, highlight the locations of different camps, and show the distance to my home.
- Text data of the camp title will be extracted and shown on the map.
- Text data of the camp title and description will be used to determine what the camp is about: STEM, arts, or sports.
- Text data of time (before 12p or after 12p) will be used to determine whether the camp is:
Morning: Meets only in the morning.
Afternoon: Meets only in the afternoon.
Morning & Afternoon: You can choose to do either morning, afternoon, or morning and afternoon.
Full-‐day: Morning and afternoon.
- Text data of the gender requirement will be used to determine whether the camp is for:
Boys & girls: boys and girls will be in different sessions.
Co-‐ed: all genders.
- Text data of online parent comments will be used to evaluate:
Whether the camp is educational. Whether the camp building is clean. Whether the camp neighborhood is safe.
- Image data of the camp’s building to determine the age of the building.
- Neighborhood safety data or images of the surrounding area is used to determine safety of the camp location.
- Photos of inside of the building to evaluate whether the camp facility is clean.
- School district data is used to estimate the home school of the kids.
- Image data of past camps will be used to determine the historical gender-‐prevalence of co-‐ed camps and race and ethnicity:
Co-‐ed (boys): Historically more boys.
Co-‐ed (girls): Historically more girls.
Race and ethnicity of the kids at the camp
Q4. How do you get the data?
“I” will search local spring break camps online, using keywords like “spring break camps in Overland Park, Kansas.” “I” will also conduct Google location search, and image search of camp photos from previous years. Google reviews of these organizations that offer spring break camps will also be collected.
Q5. Do you know any specific data set you may use for analysis of your story?
School district websites, Google location information, Google review, Google image search, websites of Spring break camps.
Q6. Data Type for each your data – for examples, number, text, images, video, audio, social network, database
This will be mostly text data, location data from Google Map, image data of these camps to determine historically whether a camp has more boys or girls and race and ethnicity, the age of the building, cleanness, and neighborhood.
Q7. What kind of analysis would you do? (Correlation, frequency, cross-‐tabulation, average, classification, regression, ANOVA, etc.)
In this case, simple descriptive statistics are sufficient.
Q8. What kind of measurements would you use? (Data quality, privacy, etc.)
I would use data quality measurement to make sure the information I gather is accurate. I do not want to send my kids to a camp that does not exist.
Q9. How much data you think you need?
Now, you are ready to imagine ways of visualizing the data. For example: