Integration of Public-Use Metadata and Flickr-hosted Image Data
Hackday here at Flite is an integral component of our company culture. As Flite employees we pride ourselves on being agile, hard-working, and results-oriented.
Keeping this in mind, Hackday 2013 was kicked off with an Iron Chef theme, culinary-based team names (go Trout!!), and a Hacker draft.
The “Chairman” unveiled the special ingredient for this Hackday challenge: Big Data. Additionally, data resources were distributed to all teams and a firm deadline of 4PM was established.
Following a quick Trout brainstorm session, we were able to establish certain goals for our Hackday. Team goals were to track interactions as a means of gaining insight to platform usage as well as to utilize both Flickr’s large photo repository and public-use metadata.
With this challenge rooted in the use of data, the initial steps were deciphering where to source data from and how to mesh the information into a useful format.
Interaction tracking data was sourced within the platform, the random-sample data was aggregated interactions across both weekends and weekdays (20,000 sessions per day) for a month long interval. After statistical analysis and visual inspection the expected results were not realized. Initial expectations were that a large difference between weekday and weekend interactions should present itself within the data. Assumptions included: employees at work during the week would fuel interactions and that weekends would be spent enjoying leisure activities meaning more time away from the “point” of interaction. However, the data tells a different or conflicting story. Graphics show similar interaction data and usage habits across both weekends and weekdays with a horizontal shift (2 hours to the right). This horizontal shift could express the later start to people’s days during the weekend as well as their tendency to stay up later on weekend nights into the AM, fueling the higher weekend interactions post midnight when compared to weekdays.
Additionally, the confidence interval is much tighter for weekday interactions when compared to the wider CI’s associated with weekend interactions. This can be attributable to the predictability of one’s workday hours and interactions when compared to the less predictable and greater variance presented by less regimented weekend days.
UFO Data & Flickr Photos
Within the group there was an affinity to make the most of the UFO data available from the National UFO Reporting Center. The UFO data was refined and filtered, leaving the desired cell-based variables including:
- color associated with UFO sighting
- gender of claimant
- shape of UFO
- use of term “alien” in the sighting description
Team focus shifted towards the use of this observational UFO data as a means of forming queries within the Flickr platform. The Flickr platform is rich with tags, locations, and personal descriptions that should match well with our workable dataset. Matching text data with image tag search results from Flickr should yield photos of these UFOs and in turn lead to visual verification of these sightings.
Initial obstacles included data formatting issues (text versus numerical). There were also problems with Flickr search results producing different responses based on manual versus automated queries. Strangely, manual searches yielded more results than automated, which essentially makes our service offered less effective and negates the applicable use of our creation. Limited auto query results were due to geographic tagging variation between Flickr (numerical) and UFO (text) data as well as the ability to include all search terms jointly using an “or” syntax.
Less than stellar query results continued, so as part of the iterative process we loosened our search parameters and tags. Exclusion of gender and color associated with UFO sightings seemed to increase query results and increase the effectiveness of our search. However, with this minimization of specified variables there is a decrease in test strength and results significance.
Following the discard of several search variables and adjustments of geographic tags into numerical data we were able to produce query results that exceeded the threshold we established as our benchmark. With the desired results in our query the JPEG photos were “in-filed” to JSON and uploaded to a user interface for the 4 PM demo.