Weather and Vehicle Collision Analysis using Immerse


#1

MapD Immerse is a powerful tool, and for my first experiment with it, I wanted to assess the relationship between vehicle collision and weather. NYPD Motor Vehicle Collision Dataset provides collision data since 2012, with details about the location, borough, date, time, number of persons injured and killed, type of vehicle involved, and any contributing factors. Weather data was extracted from Weather Underground; it consists of borough, date, time, temperature, dew point, visibility, humidity and weather conditions. I merged these two datasets together using the borough, date, and time fields to create the Weather_VehicleCollision Dataset, and I’ve made it available in a bucket on S3.

Before we begin our analysis, it’s best to highlight some discrepancies in the data. First, Brooklyn has the highest number of null values for weather conditions. This will lead to some clusters having almost no records for Brooklyn, as null and unknown values can’t be considered for analysis. Second, null values for contributing factors increases significantly in 2016; this may be the reason for a decline in the number of records with other values during the same period. Also, some other fields i.e. Precipitation has a high number of null records. Even though Vehicle Type has a really low null count, majority of the cars have been classified under Passenger Vehicle.

32%20AM
51%20AM Picture3

Customer profiling is primarily done to identify buying patterns of customers. We can deploy the same technique in profiling collisions to understand the weather patterns by creating a bubble chart using visibility, dew point, weather conditions and number of records as parameters. We get three distinct clusters; squalls and thunderstorm are outliers and collisions for those conditions are not analyzed.

Cluster 1: Low visibility (less than 8 miles) with low dew point (less than 30°F)
Cluster 1 includes eleven cold weather conditions; e.g. light snow, light ice pellets etc. The peaks in the line chart are mostly collisions caused by slippery pavement, which is very common in winter due to snow. Since this is a cyclic event, the plot appears rather flat with little rise during summer except for the sudden rise in summer 2017. This could be attributed to the unusual light summer snow / ice pellets in New York.

Cluster 2: High visibility (more than 8 miles) with medium dew point (more than 36°F and less than 46°F )
Cluster 2 comprises collisions which occurred during good weather with visibility greater than 8 miles; it’s the biggest cluster by number of collisions. Splitting the data on a temporal scale shows that the majority of collisions on weekdays occur during rush hour traffic with evening rush hour contributing more than morning traffic. On the other hand, weekends are famous for nights out. Upon further cross filtering the cluster by contributing factor and time of day, we can get an exact percentage of records for each contributing factor at a given time. Out of 78105 collision happening between 4pm - 6pm, 6% of them are due to fatigue / drowsiness. The sharp decline in the number of fatigue / drowsiness related collisions is probably due to the increase in null value during mid 2016, but the pattern continues till 2018 which comes as good news. I wonder what the plot would have looked like during recession when fewer people were commuting?

Cluster 3: Low visibility (less than 8 miles) with high dew point (more than 46°F)
Cluster 3 consists of bad weather conditions occurring in both summer and winter. But the type of events are of an entirely different nature from Cluster 1. Fog, haze, and mist in Cluster 2 appear more often with morning accidents; whereas rain and thunderstorm are the reason behind Tuesday’s extreme deviation from normal. The number of records are least for Brooklyn owing to the high number of null values.

57%20PM

Human element as a factor in collisions needs to be further researched, as most of the accidents occurred on days with good weather. I won’t be surprised if the number of collisions reduced drastically with the adoption of self driving cars. What other data sets would you consider merging with the current analysis? Let’s discuss more about this in the comments section. Till then keep calm and drive safely.