{"id":907,"date":"2023-03-15T20:59:49","date_gmt":"2023-03-16T00:59:49","guid":{"rendered":"https:\/\/www.aibistin.com\/?p=907"},"modified":"2024-03-27T18:21:28","modified_gmt":"2024-03-27T22:21:28","slug":"analysis-of-nyc-traffic-collisions-using-linux-cli-and-perl","status":"publish","type":"post","link":"https:\/\/www.aibistin.com\/?p=907","title":{"rendered":"Analysis of NYC Traffic Collisions using Linux CLI and Perl"},"content":{"rendered":"\n\n\n<div class=\"wp-block-jetpack-markdown\"><h5>Just when you think it\u2019s safe outside. Here are some statistics relating to traffic collisions in New York City.<\/h5>\n<h3>Background<\/h3>\n<p>NYC publishes vehicle collision data which anyone can access using their API. You can also download this information in standard <a href=\"https:\/\/data.cityofnewyork.us\/Public-Safety\/Motor-Vehicle-Collisions-Crashes\/h9gi-nx95\">CSV (Comma Separated Values) file format<\/a>.\nThe file is fairly large, 420 MB, with almost 2 Million lines.<\/p>\n<pre><code class=\"language-bash\">-rw-rw-r-- 1 austin austin 402M Mar  4 20:38 all_motor_vehicle_collision_data.csv\n\u2026\nbash &gt; wc -l all_motor_vehicle_collision_data.csv \n1972886 all_motor_vehicle_collision_data.csv\n<\/code><\/pre>\n<h3>Display the first five records of the dataset using <a href=\"https:\/\/ss64.com\/bash\/head.html\">head<\/a><\/h3>\n<pre><code class=\"language-bash\">bash &gt; head -n5 all_motor_vehicle_collision_data.csv \nCRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5\n09\/11\/2021,2:39,,,,,,WHITESTONE EXPRESSWAY,20 AVENUE,,2,0,0,0,0,0,2,0,Aggressive Driving\/Road Rage,Unspecified,,,,4455765,Sedan,Sedan,,,\n03\/26\/2022,11:45,,,,,,QUEENSBORO BRIDGE UPPER,,,1,0,0,0,0,0,1,0,Pavement Slippery,,,,,4513547,Sedan,,,,\n06\/29\/2022,6:55,,,,,,THROGS NECK BRIDGE,,,0,0,0,0,0,0,0,0,Following Too Closely,Unspecified,,,,4541903,Sedan,Pick-up Truck,,,\n09\/11\/2021,9:35,BROOKLYN,11208,40.667202,-73.8665,&quot;(40.667202, -73.8665)&quot;,,,1211      LORING AVENUE,0,0,0,0,0,0,0,0,Unspecified,,,,,4456314,Sedan,,,,\n<\/code><\/pre>\n<h3>Using <a href=\"https:\/\/ss64.com\/bash\/head.html\">head<\/a> to display the first record only<\/h3>\n<pre><code class=\"language-bash\">bash &gt; head -n1 all_motor_vehicle_collision_data.csv \nCRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,NUMBER OF PERSONS INJURED,NUMBER OF PERSONS KILLED,NUMBER OF PEDESTRIANS INJURED,NUMBER OF PEDESTRIANS KILLED,NUMBER OF CYCLIST INJURED,NUMBER OF CYCLIST KILLED,NUMBER OF MOTORIST INJURED,NUMBER OF MOTORIST KILLED,CONTRIBUTING FACTOR VEHICLE 1,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5\n<\/code><\/pre>\n<h3>Use Perl to list the column names in numerical order<\/h3>\n<pre><code class=\"language-perl\">bash &gt; perl -F, -an -E '$. == 1 &amp;&amp; say $i++ . &quot;\\t$_&quot; for @F'  all_motor_vehicle_collision_data.csv\n0\tCRASH DATE\n1\tCRASH TIME\n2\tBOROUGH\n3\tZIP CODE\n4\tLATITUDE\n5\tLONGITUDE\n6\tLOCATION\n7\tON STREET NAME\n8\tCROSS STREET NAME\n9\tOFF STREET NAME\n10\tNUMBER OF PERSONS INJURED\n11\tNUMBER OF PERSONS KILLED\n12\tNUMBER OF PEDESTRIANS INJURED\n13\tNUMBER OF PEDESTRIANS KILLED\n14\tNUMBER OF CYCLIST INJURED\n15\tNUMBER OF CYCLIST KILLED\n16\tNUMBER OF MOTORIST INJURED\n17\tNUMBER OF MOTORIST KILLED\n18\tCONTRIBUTING FACTOR VEHICLE 1\n19\tCONTRIBUTING FACTOR VEHICLE 2\n20\tCONTRIBUTING FACTOR VEHICLE 3\n21\tCONTRIBUTING FACTOR VEHICLE 4\n22\tCONTRIBUTING FACTOR VEHICLE 5\n23\tCOLLISION_ID\n24\tVEHICLE TYPE CODE 1\n25\tVEHICLE TYPE CODE 2\n26\tVEHICLE TYPE CODE 3\n27\tVEHICLE TYPE CODE 4\n28\tVEHICLE TYPE CODE 5\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li><code>'perl -an -E'<\/code>\n<ul>\n<li>Split up the column values into array <code>'@F'<\/code><\/li>\n<\/ul>\n<\/li>\n<li><code>'-F,'<\/code>\n<ul>\n<li>Specifies a <strong>comma<\/strong> field separator.<\/li>\n<\/ul>\n<\/li>\n<li><code>'$. == 1'<\/code>\n<ul>\n<li>The Perl special variable <code>'$.'<\/code> contains the current line number.<\/li>\n<li>Display the first line only.<\/li>\n<\/ul>\n<\/li>\n<li><code>'say $i++ . &quot;\\t$_&quot; for @F'<\/code>\n<ul>\n<li>Prints a tab separated counter variable <code>'$i'<\/code>, and the corresponding column name, stored in the Perl default variable <code>'$_'<\/code>.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Create a work-file containing the zip-code, injury count, and fatality count. Only get records that include a zip-code and at least one injury or fatality.<\/h3>\n<pre><code>3\tZIP CODE\n10\tNUMBER OF PERSONS INJURED\n11\tNUMBER OF PERSONS KILLED\n<\/code><\/pre>\n<h5>Method<\/h5>\n<ul>\n<li>Unfortunately the previous method for splitting a comma delimited file has limitations. It cannot handle fields with embedded commas. The <em>Street Name<\/em> fields often have embedded commas which will throw off our column numbering.<\/li>\n<li>To get around this we can use <a href=\"https:\/\/metacpan.org\/pod\/Text::CSV\">Text::CSV<\/a>, which has both functional and OO interfaces.  For one-liners, it exports a handy <a href=\"https:\/\/metacpan.org\/pod\/Text::CSV#csv\">csv function<\/a>. From the Text::CSV documentation <code>'my $aoa = csv (in =&gt; &quot;test.csv&quot;) or die Text::CSV_XS-&gt;error_diag;''<\/code>, it\u2019ll convert the CSV file into an array of arrays.<\/li>\n<li>I&#8217;ll modify this example slightly to <code>'csv( in =&gt; $ARGV[0], headers =&gt; qq\/skip\/ )'<\/code>. The @ARGV array contains any input arguments. The first element $ARGV[0] will contain the input CSV file. We don\u2019t need the header row, so it\u2019ll be skipped.<\/li>\n<\/ul>\n<pre><code class=\"language-perl\">perl -MText::CSV=csv  -E '$aofa = csv( in =&gt; $ARGV[0], headers =&gt; qq\/skip\/ ); ( $_-&gt;[3] =~ \/^\\S+$\/ ) &amp;&amp; say qq\/$_-&gt;[3],$_-&gt;[10],$_-&gt;[11]\/ for @{$aofa}'  all_motor_vehicle_collision_data.csv | sort -t, -k 1 -r &gt; sorted_injured_killed_by_zip.csv\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>Input file <code>'all_motor_vehicle_collision_data.csv'<\/code><\/li>\n<li><code>'perl -MText::CSV=csv'<\/code>\n<ul>\n<li>Run the perl command with <code>'-M'<\/code> switch to load a Perl module.<\/li>\n<\/ul>\n<\/li>\n<li><code>'Text::CSV=csv'<\/code>\n<ul>\n<li>Export the \u2018csv\u2019 function from the <code>'Text::CSV'<\/code> module.<\/li>\n<\/ul>\n<\/li>\n<li><code>'( $_-&gt;[3] =~ \/^\\S+$\/ )'<\/code>\n<ul>\n<li>Use a Regular expression to only process rows that have non-blank data in the <em>ZIP CODE<\/em> field.<\/li>\n<\/ul>\n<\/li>\n<li><code>'say qq\/$_-&gt;[3],$_-&gt;[10],$_-&gt;[11]\/ for @{$aofa}'<\/code>\n<ul>\n<li>Loop through the Array of Arrays <code>'$aofa'<\/code><\/li>\n<li>Print the contents of columns 3,10,11 followed by a line break.<\/li>\n<\/ul>\n<\/li>\n<li>The output is piped <code>'|'<\/code> into the Linux sort command.\n<ul>\n<li>Sorting on the first field, <em>ZIP CODE<\/em> and redirecting, <code>'&gt;'<\/code> into a new file, <code>'sorted_injured_killed_by_zip.csv'<\/code>.<\/li>\n<li>See the <a href=\"https:\/\/ss64.com\/bash\/sort.html\">ss64.com site<\/a> for more details on the Linux sort command.<\/li>\n<\/ul>\n<\/li>\n<li>The new file has about 1.36 Million lines.<\/li>\n<\/ul>\n<h4>Get a line count with <a href=\"https:\/\/ss64.com\/bash\/wc.html\">wc<\/a>. Display the first 10 records using <a href=\"https:\/\/ss64.com\/bash\/head.html\">head<\/a><\/h4>\n<pre><code class=\"language-bash\">bash &gt; wc -l sorted_injured_killed_by_zip.csv \n1359291 sorted_injured_killed_by_zip.csv\nbash &gt; head -n10 sorted_injured_killed_by_zip.csv | column -t -s, --table-columns=ZipCode,#Injured,#Killed\nZipCode  #Injured  #Killed\n11697    4         0\n11697    3         0\n11697    2         0\n11697    2         0\n11697    2         0\n11697    1         0\n11697    1         0\n11697    1         0\n11697    1         0\n11697    1         0\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li><code>'wc -l'<\/code>\n<ul>\n<li>Counts the number of lines in our new file<\/li>\n<\/ul>\n<\/li>\n<li><code>'head -n 10'<\/code>\n<ul>\n<li>Prints out the first 10 lines of the file<\/li>\n<\/ul>\n<\/li>\n<li><code>'column -t -s, --table-columns=ZipCode,#Injured,#Killed'<\/code>\n<ul>\n<li><a href=\"https:\/\/www.man7.org\/linux\/man-pages\/man1\/column.1.html\">column<\/a><\/li>\n<li><code>'-t'<\/code> switch will tell <code>'column'<\/code> to print in table format.<\/li>\n<li><code>'-s'<\/code> switch specifies an input delimiter of &#8216;,&#8217;.<\/li>\n<li>The output is tabbed.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>List the 10 worst zip codes for injuries<\/h3>\n<h4>We can use the output file, sorted_injured_killed_by_zip.csv, from the previous example,<\/h4>\n<pre><code class=\"language-perl\">perl -n -E '@a=split(q\/,\/,$_);$h{$a[0]} += $a[1]; END{say qq\/$_,$h{$_}\/ for keys %h}' sorted_injured_killed_by_zip.csv  | sort -nr -t, -k 2  | head -n10 | column -t -s, --table-columns=ZipCode,#Injured\nZipCode  #Injured\n11207    10089\n11236    7472\n11203    7426\n11212    6676\n11226    6103\n11208    6027\n11234    5505\n11434    5403\n11233    5159\n11385    4440\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li><code>'@a=split(q\/,\/,$_);'<\/code>\n<ul>\n<li>As there are no embedded commas in this file we use the Perl \u2018split\u2019 function to break up the 3 CSV fields in each row into array <code>'@a'<\/code>.<\/li>\n<\/ul>\n<\/li>\n<li><code>'$h{$a[0]} += $a[1];'<\/code>\n<ul>\n<li>The first element of each row, <em>ZIP CODE<\/em> is used as a key for Hash&#8217;<code>%h'<\/code>.<\/li>\n<li>The value is the accumulated number of injuries for that <em>ZIP CODE<\/em>.<\/li>\n<\/ul>\n<\/li>\n<li><code>'$h{$a[0]} += $a[1]'<\/code>\n<ul>\n<li>We accumulate the second element, $[1], which contains <code>'NUMBER OF PERSONS INJURED'<\/code><\/li>\n<li>We can set a value for a Hash key without checking if it exists already.<\/li>\n<li>This is called Autovivification which is explained nicely by <a href=\"https:\/\/perlmaven.com\/autovivification\">The Perl Maven<\/a>.<\/li>\n<\/ul>\n<\/li>\n<li><code>'END{say qq\/$_,$h{$_}\/ for keys %h}'<\/code>\n<ul>\n<li>The <code>'END{}<\/code>&#8216;block runs after all the rows are processed.<\/li>\n<li>The keys(Zip Codes) are read and printed along with their corresponding values.<\/li>\n<li>We could have used Perl to sort the output by the keys, or values.\n<ul>\n<li>I used the Linux <a href=\"https:\/\/ss64.com\/bash\/sort.html\">sort<\/a>.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li><code>'sort -nr -t, -k 2'<\/code>\n<ul>\n<li>Will perform a numeric sort, descending on the # of people injured.<\/li>\n<\/ul>\n<\/li>\n<li><code>'head -n10'<\/code>\n<ul>\n<li>Will get the first 10 records printed.<\/li>\n<\/ul>\n<\/li>\n<li><code>'column -t -s, --table-columns=ZipCode,#Injured'<\/code><\/li>\n<li><code>The <\/code>&#8216;columns&#8217;` command will produce a prettier output.\n<ul>\n<li><code>'-t'<\/code> for table format.<\/li>\n<li><code>'-s'<\/code> to specify that the fields are comma separated<\/li>\n<li><code>'--table-columns''<\/code> to add column header names.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<p>Zip code <a href=\"http:\/\/www.neighborhoodlink.com\/zip\/11207\">11207<\/a>, which encompasses East New York, Brooklyn, as well as a small portion of Southern Queens, has a lot of issues with traffic safety.<\/p>\n<h3>Display the 10 worst zip codes for traffic fatalities<\/h3>\n<pre><code class=\"language-perl\">bash &gt; perl -n -E '@a=split(q\/,\/,$_);$h{$a[0]} += $a[2]; END{say qq\/$_,$h{$_}\/ for keys %h}' sorted_injured_killed_by_zip.csv  | sort -nr -t, -k 2  | head -n10 | column -t -s, --table-columns=ZipCode,#Killed\nZipCode  #Killed\n11236    44\n11207    34\n11234    29\n11434    25\n11354    25\n11229    24\n11208    24\n11206    23\n11233    22\n11235    21\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>With a few minor adjustments, we got the worst zip codes for traffic collision fatalities<\/li>\n<li><code>'$h{$a[0]} += $a[2]'<\/code>\n<ul>\n<li>Accumulate the third element, $[2], which contains <code>'NUMBER OF PERSONS KILLED'<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<ul>\n<li>Zip code <a href=\"http:\/\/www.neighborhoodlink.com\/zip\/11236\">11236<\/a>, which includes Canarsie Brooklyn is the worst for traffic fatalities according to this data.\n<ul>\n<li>Zip code <strong>11207<\/strong> is also very bad for traffic fatalities, as well as being the worst for collision injuries<\/li>\n<\/ul>\n<\/li>\n<li>These stats are not 100 percent correct, as out of 1,972,886 collision records, 1,359,291 contained Zip codes.\n<ul>\n<li>We have 613,595 records with no zip code, which were not included in the calculations.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h3>Some NYC Borough Stats<\/h3>\n<p>Similar to how we created the <code>'sorted_injured_killed_by_zip.csv'<\/code>, we can run the following command sequence to create a new file <code>'sorted_injured_killed_by_borough.csv'<\/code><\/p>\n<pre><code class=\"language-perl\">perl -MText::CSV=csv  -E '$aofa = csv( in =&gt; $ARGV[0], headers =&gt; qq\/skip\/ ) ; ( $_-&gt;[2] =~ \/^\\S+\/ ) &amp;&amp; say qq\/$_-&gt;[2],$_-&gt;[10],$_-&gt;[11]\/ for @{$aofa}'  all_motor_vehicle_collision_data.csv | sort  -t, -k 3rn -k 2rn -k 1  &gt;|  sorted_injured_killed_by_borough.csv\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>The Borough field is the third column, <code>'2\tBOROUGH'<\/code>, starting from 0, in the <code>'all_motor_vehicle_collision_data.csv'<\/code> file.<\/li>\n<li><code>'( $_-&gt;[2] =~ \/^\\S+\/ )'<\/code>\n<ul>\n<li>Only get rows which have non blank data in the <em>BOROUGH<\/em> field.<\/li>\n<\/ul>\n<\/li>\n<li><code>'sort -t, -k 3rn -k 2rn -k 1'<\/code>\n<ul>\n<li>I added some more precise sorting, which is unnecessary except to satisfy my curiosity.\n<ul>\n<li><a href=\"https:\/\/ss64.com\/bash\/sort.html\">sort<\/a><\/li>\n<\/ul>\n<\/li>\n<li><code>'-k 3rn<\/code>&#8221;\n<ul>\n<li>Sort by column 3(starting @ 1), which is the fatality count field.<\/li>\n<li>This is sorted numerically in descending order.<\/li>\n<\/ul>\n<\/li>\n<li><code>'-k 2rn<\/code>&#8221;\n<ul>\n<li>When equal, the injury count is also sorted numerically, descending.<\/li>\n<\/ul>\n<\/li>\n<li><code>'-k 1'<\/code>\n<ul>\n<li>The Borough is sorted in ascending order as a tiebreaker.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h4>Display the first 10 rows of this file.<\/h4>\n<pre><code class=\"language-bash\">bash &gt; head -n10 sorted_injured_killed_by_borough.csv | column -t -s, --table-columns=Borough,#Injured,#Killed\nBorough        #Injured  #Killed\nMANHATTAN      12        8\nQUEENS         3         5\nQUEENS         15        4\nQUEENS         1         4\nSTATEN ISLAND  6         3\nBROOKLYN       4         3\nBROOKLYN       3         3\nQUEENS         3         3\nBROOKLYN       1         3\nQUEENS         1         3\n<\/code><\/pre>\n<h3>Sanity check if we got all five boroughs<\/h3>\n<pre><code class=\"language-bash\">cut -d, -f 1  sorted_injured_killed_by_borough.csv | sort -u \nBRONX\nBROOKLYN\nMANHATTAN\nQUEENS\nSTATEN ISLAND\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li><code>'cut -d, -f 1'<\/code>\n<ul>\n<li><a href=\"https:\/\/ss64.com\/bash\/cut.html\">cut<\/a> to split the comma delimited file records.<\/li>\n<li><code>'-d,'<\/code>\n<ul>\n<li>Specifies that the cut will comma delimited<\/li>\n<\/ul>\n<\/li>\n<li><code>'-f 1'<\/code>\n<ul>\n<li>Get the first field from the <code>cut<\/code>, which is the Borough Name.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li><code>'sort -u'<\/code>\n<ul>\n<li>Sorts and prints only the unique values to STDOUT<\/li>\n<\/ul>\n<\/li>\n<li>We got all 5 New York City boroughs in this file.<\/li>\n<\/ul>\n<h3>Display collision injuries for each borough<\/h3>\n<pre><code class=\"language-perl\">bash &gt; perl -n -E '@a=split(q\/,\/,$_);$h{$a[0]} += $a[1]; END{say qq\/$_,$h{$_}\/ for keys %h}' sorted_injured_killed_by_borough.csv   | sort -nr -t, -k 2 | column -t -s,\nBROOKLYN       137042\nQUEENS         105045\nBRONX          62880\nMANHATTAN      61400\nSTATEN ISLAND  15659\n<\/code><\/pre>\n<h6>Observation<\/h6>\n<ul>\n<li>Brooklyn emerges as the Borough with the most traffic injuries.<\/li>\n<\/ul>\n<h3>Display collision fatalities by Borough<\/h3>\n<pre><code class=\"language-perl\">bash &gt; perl -n -E '@a=split(q\/,\/,$_);$h{$a[0]} += $a[2]; END{say qq\/$_,$h{$_}\/ for keys %h}' sorted_injured_killed_by_borough.csv   | sort -nr -t, -k 2 | column  -J -s, --table-columns Borough,#Killed\n{\n   &quot;table&quot;: [\n      {\n         &quot;borough&quot;: &quot;BROOKLYN&quot;,\n         &quot;#killed&quot;: &quot;564&quot;\n      },{\n         &quot;borough&quot;: &quot;QUEENS&quot;,\n         &quot;#killed&quot;: &quot;482&quot;\n      },{\n         &quot;borough&quot;: &quot;MANHATTAN&quot;,\n         &quot;#killed&quot;: &quot;300&quot;\n      },{\n         &quot;borough&quot;: &quot;BRONX&quot;,\n         &quot;#killed&quot;: &quot;241&quot;\n      },{\n         &quot;borough&quot;: &quot;STATEN ISLAND&quot;,\n         &quot;#killed&quot;: &quot;88&quot;\n      }\n   ]\n}\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>Similar to the Injury count by Borough, this counts all fatalities by borough and prints the output in JSON format.<\/li>\n<li><code>'column -J -s, --table-columns Borough,#Killed'<\/code>  Use the <code>'column'<\/code> command with the <code>'-J'<\/code> switch, for JSON, instead of <code>'-t'<\/code> for Table.<\/li>\n<\/ul>\n<h3>I forgot to mention what date range is involved with this dataset. We can check this with the <a href=\"https:\/\/ss64.com\/bash\/cut.html\">cut<\/a> command.<\/h3>\n<pre><code class=\"language-bash\">cut -d, -f1 all_motor_vehicle_collision_data.csv | cut -d\/ -f3 | sort -u\n2012\n2013\n2014\n2015\n2016\n2017\n2018\n2019\n2020\n2021\n2022\n2023\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>Get the date field, <code>'0\tCRASH DATE'<\/code> which is in <code>'mm\/dd\/yyyy'<\/code> format.<\/li>\n<li><code>'cut -d, -f '<\/code> <code>'all_motor_vehicle_collision_data.csv'<\/code>\n<ul>\n<li>Get` the first column\/field of data for every row of this CSV file.<\/li>\n<li><code>'-d,'<\/code> specifies that we are cutting on the comma delimiters.<\/li>\n<li><code>'-f 1'<\/code> specifies that we want the first column\/field only\n<ul>\n<li>This is the date in <code>'mm\/dd\/yyyy'<\/code> format.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li><code>'cut -d\/ -f3'<\/code>\n<ul>\n<li>Will cut the date using <code>'\/<\/code>\u2019 as the delimiter.<\/li>\n<li>Grab the third field from this, which is the four digit year.<\/li>\n<\/ul>\n<\/li>\n<li><code>'sort -u'<\/code>\n<ul>\n<li>The years are then sorted with duplicates removed.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<ul>\n<li>The dataset started sometime in 2012 and continues until now, March 2023.<\/li>\n<\/ul>\n<h3>Display the 20 worst days for collisions in NYC<\/h3>\n<pre><code class=\"language-bash\">bash &gt; cut -d, -f1 all_motor_vehicle_collision_data.csv | awk -F '\/' '{print $3 &quot;-&quot; $1 &quot;-&quot; $2}' | sort | uniq -c | sort -k 1nr | head -n20 | column -t --table-columns=#Collisions,Date\n#Collisions  Date\n1161        2014-01-21\n1065        2018-11-15\n999         2017-12-15\n974         2017-05-19\n961         2015-01-18\n960         2014-02-03\n939         2015-03-06\n911         2017-05-18\n896         2017-01-07\n884         2018-03-02\n883         2017-12-14\n872         2016-09-30\n867         2013-11-26\n867         2018-11-09\n857         2017-04-28\n851         2013-03-08\n851         2016-10-21\n845         2017-06-22\n845         2018-06-29\n841         2018-12-14\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>Get a count for all collisions for each date on record<\/li>\n<li>Display the first 20 with the highest collision count<\/li>\n<li><a href=\"https:\/\/ss64.com\/bash\/cut.html\">cut<\/a>\n<ul>\n<li>Get the first column from the dataset.<\/li>\n<li>Pipe this date into the <a href=\"https:\/\/ss64.com\/bash\/awk.html\">awk<\/a> command.<\/li>\n<li>AWK is a very useful one-liner tool as well as being a full scripting language.<\/li>\n<\/ul>\n<\/li>\n<li><code>'awk -F '\/' '{print $3 &quot;-&quot; $1 &quot;-&quot; $2}'<\/code>\n<ul>\n<li><code>'-F '\/' '<\/code>\n<ul>\n<li>Split the date into separate fields using the <code>\u2019\/\u2019<\/code> as a delimiter.<\/li>\n<li>$1 contains the month value, $2 contains the day of month and $3 contains the four digit year value.<\/li>\n<li>These will be printed in the format <code>'\u201dyyyy-mm-dd\u201d<\/code>.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Dates are then sorted and piped into the <a href=\"https:\/\/ss64.com\/bash\/uniq.html\">uniq<\/a> command.<\/li>\n<li><code>'uniq -c'<\/code>\n<ul>\n<li>Will create a unique output.<\/li>\n<li><code>'-c'<\/code> switch gets a count of all the occurrences for each value.<\/li>\n<\/ul>\n<\/li>\n<li>The output is piped into another sort command, which sorts by the number of occurrences descending.<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<ul>\n<li>I\u2019m not sure if there is any explanation for why some days have a lot more collisions than others. Weatherwise, January 21 2014 was a cold day, but otherwise uneventful. November 15 2018 had some snow, but not a horrific snowfall. The clocks went back on November 4, so that wouldn\u2019t be a factor.<\/li>\n<li><a href=\"https:\/\/www.wunderground.com\/history\/daily\/us\/ny\/new-york-city\/KLGA\/date\/2014-1-21\">2014-01-21 weather <\/a><\/li>\n<li><a href=\"https:\/\/www.wunderground.com\/history\/daily\/us\/ny\/new-york-city\/KLGA\/date\/2018-11-15\">2018-11-15 weather<\/a><\/li>\n<\/ul>\n<h3>Display the twenty worst times during the day for collisions<\/h3>\n<pre><code class=\"language-bash\">bash &gt; cut -d, -f2 all_motor_vehicle_collision_data.csv | sort | uniq -c | sort -k 1nr | head -n20 |column -t  --table-columns=#Collisions,Time\n#Collisions  Time\n27506       16:00\n26940       17:00\n26879       15:00\n24928       18:00\n24667       14:00\n22914       13:00\n20687       9:00\n20641       12:00\n20636       19:00\n19865       16:30\n19264       8:00\n19107       10:00\n19106       14:30\n19010       0:00\n18691       11:00\n18688       17:30\n16646       18:30\n16602       20:00\n16144       8:30\n16008       13:30\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>We use the time field, <code>'1\tCRASH TIME'<\/code>, which is in 24 hour format, &#8216;HH:MM&#8217;<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<ul>\n<li>Using the &#8216;actual time&#8217; may be a bit too specific. I\u2019ll use the &#8216;hour of day&#8217; instead to give a 60 minute time frame.<\/li>\n<\/ul>\n<pre><code class=\"language-bash\"> bash &gt; cut -d, -f2 all_motor_vehicle_collision_data.csv | cut -d : -f1 | sort | uniq -c | sort -k 1nr | head -n10 | column -t --table-columns=#Collisions,Hour\n#Collisions  Hour\n143012      16\n139818      17\n132443      14\n123761      15\n122971      18\n114555      13\n108925      12\n108593      8\n105206      9\n102541      11\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>Similar to the previous example, except this time the <a href=\"https:\/\/ss64.com\/bash\/cut.html\">cut<\/a> command is used to split the time HH:MM, delimited by <code>':'<\/code><\/li>\n<li><code>'cut -d : -f 1'<\/code>\n<ul>\n<li><code>'-d'<\/code>\n<ul>\n<li>The &#8216;cut&#8217; delimiter is &#8216;:&#8217;<\/li>\n<\/ul>\n<\/li>\n<li><code>'-f 1'<\/code>\n<ul>\n<li>Grab the first field, &#8216;HH&#8217; of the &#8216;HH:MM&#8217;.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<li>Use something like the <a href=\"https:\/\/ss64.com\/bash\/printf.html\">printf<\/a> command to append <code>':00'<\/code> to those hours.<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<p>As you would expect, most collisions happen during rush hour.<\/p>\n<h3>Display the worst years for collisions<\/h3>\n<pre><code class=\"language-bash\">bash &gt; cut -d, -f1 all_motor_vehicle_collision_data.csv | cut -d '\/' -f3 | sort | uniq -c | sort -k 1nr | head -n10 | column -t --table-columns=#Collisions,Year\n#Collisions  Year\n231564       2018\n231007       2017\n229831       2016\n217694       2015\n211486       2019\n206033       2014\n203734       2013\n112915       2020\n110546       2021\n103745       2022\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>We use the first column, <code>'0\tCRASH DATE'<\/code> again<\/li>\n<li><code>'cut -d '\/' -f3'<\/code>\n<ul>\n<li>Extracts the <code>'yyyy'<\/code> from the <code>'mm\/dd\/yyyy'<\/code><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<ul>\n<li>Some improvement seen in 2020, 2021 and 2022, if you can believe the data.<\/li>\n<li>One unscientific observation here is that the NYPD may have been much less vigilant in the past few years than they were prior to Commissioner Raymond Kelly and his successors.<\/li>\n<li>Also, by only printing out the worst 10 years, partial years 2012 and 2023 were excluded.<\/li>\n<\/ul>\n<h3>For a comparison. See how many people were injured or killed in traffic collisions.<\/h3>\n<h6>First create a work file, <code>'sorted_injured_killed_by_year.csv'<\/code>, with three columns, Year, Injured count and Fatality count<\/h6>\n<p>We need the <a href=\"https:\/\/metacpan.org\/pod\/Text::CSV\">Text::CSV<\/a> Perl module here due to those embedded commas in earlier fields.  Below are the three fields needed.<\/p>\n<pre><code>0\tCRASH DATE\n10\tNUMBER OF PERSONS INJURED\n11\tNUMBER OF PERSONS KILLED\n<\/code><\/pre>\n<pre><code class=\"language-perl\">bash &gt; perl -MText::CSV=csv -E '$aofa = csv( in =&gt; $ARGV[0], headers =&gt; qq\/skip\/ ); ($_-&gt;[10] || $_-&gt;[11]) &amp;&amp; say substr($_-&gt;[0],6,4) . qq\/,$_-&gt;[10],$_-&gt;[11]\/ for @{$aofa}' all_motor_vehicle_collision_data.csv | sort &gt; sorted_injured_killed_by_year.csv\n<\/code><\/pre>\n<h5>Check out the work file we just created<\/h5>\n<pre><code class=\"language-bash\">bash &gt; wc -l sorted_injured_killed_by_year.csv \n433081 sorted_injured_killed_by_year.csv\n\nbash &gt; head -n10 sorted_injured_killed_by_year.csv | column -t -s, --table-columns=Year,#Injured,#Killed\nYear  #Injured  #Killed\n2012  0         1\n2012  0         1\n2012  0         1\n2012  0         1\n2012  0         1\n2012  0         1\n2012  0         1\n2012  0         1\n2012  0         1\n2012  0         1\n<\/code><\/pre>\n<h4>Worst years for collision injuries<\/h4>\n<pre><code class=\"language-perl\">perl -n -E '@a=split(q\/,\/,$_);$h{$a[0]} += $a[1]; END{say qq\/$_,  $h{$_}\/ for sort {$h{$b} &lt;=&gt; $h{$a} } keys %h}' sorted_injured_killed_by_year.csv | head -n10 |  column  -t -s, --table-columns=Year,#Injured \nYear  #Injured\n2018    61941\n2019    61389\n2017    60656\n2016    60317\n2013    55124\n2022    51883\n2021    51780\n2015    51358\n2014    51223\n2020    44615\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>This is similar to how we got the Zip Code and Borough data previously.<\/li>\n<li>This time the Perl sort is used instead of the Linux sort.<\/li>\n<li><code>'END{say qq\/$_,  $h{$_}\/ for sort {$h{$b} &lt;=&gt; $h{$a} } keys %h}'<\/code>\n<ul>\n<li><code>'for'<\/code> statement loops through the <code>'%h'<\/code> hash keys(years).<\/li>\n<li>The corresponding Hash values(Injured count), are sorted in descending order.<\/li>\n<\/ul>\n<\/li>\n<li><code>'sort {$h{$b} &lt;=&gt; $h{$a} }'<\/code>.\n<ul>\n<li>$a and $b are default Perl sort variables.<\/li>\n<li>Rearranged it to <code>'sort {$h{$a} &lt;=&gt; $h{$b} }'<\/code>, to sort the injury count in ascending order.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<p>While the collision count may have gone down, there isn&#8217;t any real corresponding downward trend in injuries.<\/p>\n<h4>The worst years for collision fatalities.<\/h4>\n<pre><code class=\"language-perl\">bash &gt; perl -n -E '@a=split(q\/,\/,$_);$h{$a[0]} += $a[2]; END{say qq\/$_,  $h{$_}\/ for sort {$h{$b} &lt;=&gt; $h{$a} } keys %h}' sorted_injured_killed_by_year.csv | head -n10 |  column  -t -s, --table-columns=Year,#Killed\nYear  #Killed\n2013    297\n2021    294\n2022    285\n2020    268\n2014    262\n2017    256\n2016    246\n2019    244\n2015    243\n2018    231\n<\/code><\/pre>\n<h6>Explanation<\/h6>\n<ul>\n<li>Slightly modified version of the injury by year count.<\/li>\n<\/ul>\n<h6>Observation<\/h6>\n<p>Same as with the injuries count. There isn&#8217;t any real corresponding downward trend in traffic collision fatalities.<\/p>\n<h3>Conclusion<\/h3>\n<p>There\u2019s lots more work that can be done to extract meaningful information from this dataset.<br>\nWhat\u2019s clear to me, is that all the political rhetoric and money poured into <a href=\"https:\/\/www.nyc.gov\/content\/visionzero\/pages\/\">Vision Zero<\/a> has yielded little in terms of results.\nMost of the solutions are obvious from a logical point of view, but not a political point of view. I walk and cycle these streets and know how dangerous it is to cross at the \u201cdesignated\u201d crosswalks when cars and trucks are turning in on top of you. Cycling in NYC is even worse.<\/p>\n<h4>Some sugggested solutions<\/h4>\n<ul>\n<li>Delayed green lights co cars cars don&#8217;t turn in on pedestrians at crosswalks.<\/li>\n<li>Much higher tax and registraton fees for giant SUV&#8217;s and pickup trucks. The don\u2019t belong in the city.<\/li>\n<li>Better bike lanes, instead of meaningless lines painted on the road.\n<ul>\n<li>Many bike lanes are used as convenient parking for NYPD and delivery vehicles.<\/li>\n<\/ul>\n<\/li>\n<li>Basic enforcement of traffic laws, which isn&#8217;t being done now.\n<ul>\n<li>Drivers ignore red lights, speed limits, noise restrictions etc. when they know they aren&#8217;t being enforced.<\/li>\n<li>Driving while texting or yapping on the phone is the norm, not the exception.<\/li>\n<\/ul>\n<\/li>\n<li>Drastically improve public transit, especially in areas not served by the subway system.<\/li>\n<\/ul>\n<h3>Some Perl CLI Resources<\/h3>\n<p><a href=\"https:\/\/perldoc.perl.org\/perlrun\">Perldocs &#8211; perlrun<\/a><\/p>\n<p><a href=\"https:\/\/catonmat.net\/books\">Peteris Krumins has some great e-books<\/a><\/p>\n<p><a href=\"https:\/\/www.perl.com\/pub\/2004\/08\/09\/commandline.html\/\">Dave Cross &#8211; From one of his older posts on perl.com<\/a><\/p>\n<h3>Some NYC Street Resources<\/h3>\n<p><a href=\"https:\/\/nyc.streetsblog.org\/\">StreetsBlog NYC<\/a><\/p>\n<p><a href=\"https:\/\/hellgatenyc.com\/\">Hellgate NYC &#8211; Local NYC News<\/a><\/p>\n<p><a href=\"https:\/\/elkue.com\/\">Liam Quigley &#8211; Local Reporter<\/a><\/p>\n<p><a href=\"https:\/\/twitter.com\/_elkue\">More Liam Quigley &#8211; Twitter<\/a><\/p>\n<p><a href=\"https:\/\/www.youtube.com\/watch?v=jN7mSXMruEo\">These Stupid Trucks are Literally Killing Us &#8211; YouTube<\/a><\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[111,113,105,114,110,112,101],"tags":[121,122,115,49,133,20,128,117,132],"class_list":["post-907","post","type-post","status-publish","format-standard","hentry","category-bash","category-cli","category-csv","category-json","category-linux","category-oneliner","category-perl","tag-awk","tag-cli","tag-linux","tag-newyorkcity","tag-oneliner","tag-perl","tag-sort","tag-traffic","tag-trafficsafety"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/www.aibistin.com\/index.php?rest_route=\/wp\/v2\/posts\/907","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.aibistin.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aibistin.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aibistin.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aibistin.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=907"}],"version-history":[{"count":29,"href":"https:\/\/www.aibistin.com\/index.php?rest_route=\/wp\/v2\/posts\/907\/revisions"}],"predecessor-version":[{"id":1043,"href":"https:\/\/www.aibistin.com\/index.php?rest_route=\/wp\/v2\/posts\/907\/revisions\/1043"}],"wp:attachment":[{"href":"https:\/\/www.aibistin.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=907"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aibistin.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=907"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aibistin.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=907"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}