If you’ve already read my first post about awk, thanks for reading this one too! If not, that’s probably a better place to start.

Let’s continue where we left off in the last post. We were looking at one in every 500 rows to get a better picture of the data. For a reminder, here are the headers:

# In this file, the second row has headers. Pattern match the second row
# Recall, the default action is to print the line
$ awk 'NR == 2' education.csv
Region/Country/Area,,Year,Series,Value,Footnotes,Source

and here is the output if we look at one in every 500 lines:

# Pattern match all line numbers which are multiples of 500
$ awk 'NR % 500 == 0' education.csv
4,Afghanistan,2004,Gross enrollment ratio - Primary (female),65.0020,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
48,Bahrain,2005,Gross enrollment ratio - Tertiary (female),39.8646,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
76,Brazil,2009,Students enrolled in primary education (thousands),"17,451.8860",,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
148,Chad,2005,Students enrolled in tertiary education (thousands),12.3730,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
191,Croatia,2015,Students enrolled in secondary education (thousands),360.2050,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
222,El Salvador,2010,Students enrolled in primary education (thousands),939.7260,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
276,Germany,2015,Students enrolled in tertiary education (thousands),"2,977.7810",,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
360,Indonesia,2014,Students enrolled in primary education (thousands),"29,838.4400",,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
414,Kuwait,2015,Gross enrollment ratio - Secondary (male),93.9781,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
454,Malawi,2010,Gross enrollement ratio - Primary (male),131.4923,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
500,Montserrat,2007,Gross enrollment ratio - Secondary (male),101.0870,Estimate.,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
807,North Macedonia,2010,Students enrolled in secondary education (thousands),196.9910,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
630,Puerto Rico,2015,Gross enrollment ratio - Secondary (male),83.8889,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
682,Saudi Arabia,2009,Students enrolled in primary education (thousands),"3,255.2430",,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
275,State of Palestine,2015,Gross enrollment ratio - Primary (female),94.3882,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
788,Tunisia,2011,Gross enrollment ratio - Tertiary (male),26.7459,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."
862,Venezuela (Boliv. Rep. of),2014,Gross enrollement ratio - Primary (male),101.5835,,"United Nations Educational, Scientific and Cultural Organization (UNESCO), Montreal, the UNESCO Institute for Statistics (UIS) statistics database, last accessed March 2019."

So many different things are happening in this data file! We have data

  1. About different countries, not just aggregates
  2. From many different years
  3. Broken down by males / females
  4. Broken down by primary, secondary, tertiary

Let’s dive into each part! We can first look at all of the unique fields within the second column, and see exactly what is happening with all of the totals and countries.

Awk Arrays

In order to do this, we’re going to need to learn about arrays. Arrays in awk are most similar to maps or hashes in other languages. They store values at indexes. But, these indexes can be strings or integers. This means we get a key-value lookup from arrays. We don’t need to declare our arrays in any special way, we can just start putting values into them.

To start, let’s get the count of each key in the second column. We can do the following:

# FPAT gives us pattern matching for our fields, saying they're anything
# without a comma, or surrounded by double quotes
$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" } \
NR > 2 { seen[$2]++ }
# Instructions which follow end execute after reading the file
END { for(country in seen) { print country, seen[country] } }' education.csv
Tuvalu 30
Malaysia 45
Micronesia (Fed. States of) 16
Eritrea 51
...

This program got a little complicated. Let’s look at what is new here:

  • { seen[$2]++ } is incrementing the value of seen at the key $2, which is our second field
  • for(country in seen) { print country, seen[country] } is iterating over our seen array, and printing all of the keys and values

Great. But this data super helpful. It’s telling us all of the countries we have, but it’s not giving us any information about their order, or importantly, what’s happening with the total fields.

For that, we can use a super slick awk snippet:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" } \
NR > 2 && !seen[$2]++ { print $2 }' \
education.csv
"Total, all countries or areas"
Northern Africa
Sub-Saharan Africa
Northern America
Latin America & the Caribbean
Eastern Asia
South-eastern Asia
Southern Asia
Western Asia
Europe
Oceania
Afghanistan
Albania
Algeria
American Samoa
...
Yemen
Zambia
Zimbabwe

This output is exactly what we want! There’s one piece of that command which is new. We’re using !seen[$2]++ in the pattern section of our pattern { action } syntax. This means we’ll match anything where !seen[$2]++ is true. But… what does it mean for !seen[$2]++ to be true?

Let’s break it down. ++ is a post-increment. So it’ll increment the value of seen[$2] by 1. But it won’t actually change the value of !seen[$2] when we evaluate its boolean value. seen[$2] will not have been set the first time we try to access it. In awk, this means we’ll get a value of "". And !"" in awk is true. So the first time we see a key, the pattern will evaluate to true, it will match! But, every subsequent time we see that key, we’ll have a value in seen[key], starting at 1, and incrementing to 2, and so on. In awk, !1 is false. The pattern won’t match in the second, or any subsequent call.

This means we’ll print out the second column the first time we see a value (with our { print $2 }), and never again for that same value. Our output is therefore all the distinct values in the second column, ordered by their first appearance. Critical to our understanding of the data, we can see that there are a few regions, and then countries alphabetized from Afghanistan to Zimbabwe.

If we figure out which row number Afghanistan starts on, we can just look at data from that row number onwards, and ignore the data by region. We can use what we just learned above to print the row number on the first instance of Afghanistan

$ awk '/Afghanistan/ && !seen++ { print NR }' education.csv
498

/Afghanistan/ says look for a row which matches the pattern (eg contains the word) Afghanistan, and our nifty !seen++ will ensure we only print the first time we see such a row.

Let’s just double check we’re not missing any countries before Afghanistan by printing all of the unique values in the second column between rows 2 and 498:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR > 2 && NR < 498 && !seen[$2]++ { print $2 }' \
education.csv
"Total, all countries or areas"
Northern Africa
Sub-Saharan Africa
Northern America
Latin America & the Caribbean
Eastern Asia
South-eastern Asia
Southern Asia
Western Asia
Europe
Oceania

Perfect - no countries. Okay, so we’re starting on line 498.

The next thing we noticed about the data is the years are all different. Let’s take a closer look, using what we just learned.

Awk for loops

First, which field is year? If you’ll remember, the second row of this file is the actual header row:

$ awk 'NR == 2' education.csv
Region/Country/Area,,Year,Series,Value,Footnotes,Source

We can see that Year is the 3rd column. But, we’re trying to learn, so let’s figure out a way to get that information programatically. NF will tell us the number of fields in a record. We can construct a little for loop to figure out the field which matches Year. The syntax is similar to many languages:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR == 2 {
  # Iterate over all of the fields
  for(field_number = 1; field_number < NF; field_number++) {
    # Select the field which has a value of "Year"
    if ($field_number == "Year") { print field_number }
  }
}' education.csv
3

Confirmed, year is the 3rd field! Let’s grab the distinct years in our file again using our !seen[$3]++ magic:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR >= 498 && !seen[$3]++ { print $3 }'  education.csv
2004
2005
2009
2010
2014
2015
2017
...

Not super useful. We actually just want to find a recent year that has a lot of data. We can do this by getting the total counts of how many times each year appears:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR >= 498 { years[$3]++ } END
{ for(year in years) print year, years[year] }'  education.csv
2000 65
2001 48
2002 94
2003 107
2004 206
2005 1334
2006 76
2007 73
2008 91
2009 225
2010 1315
2011 98
2012 188
2013 202
2014 1299
2015 1301
2016 633
2017 740
2018 38

Okay, looks like 2015 is a recent year with a whole lot of data. From now on, we’ll make sure we’re also matching the pattern $3 == 2015

Multiple pattern / action pairs

Next up, we noticed that the data was broken down by males / females in the 4th field. Let’s pattern match and explore there! Remember, we can have as many patterns and actions as we want.

So let’s pattern match on female fist, and then male. Adding on to what we’ve already done, we’ll now have the pattern NR >= 498 && $3 == 2015 && $4 ~ /female/ and similar for male. We can double check that we have the same number of records for females and males before looking at aggregate data:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
# Pattern match all country, 2015, female records and increase the count
NR >= 498 && $3 == 2015 && $4 ~ /female/ { total_female_records++ }
END { print total_female_records };
# Pattern match all country, 2015, male records and increase the count
NR >= 498 && $3 == 2015 && $4 ~ /male/ { total_male_records++ }
END { print total_male_records }' education.csv
423
846

We have double as many male records. Ah! "female" will also match /male/. Let’s be a bit more explicit when looking for /male/ by adding && $4 !~ /female/:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR >= 498 && $3 == 2015 && $4 ~ /female/ { total_female_records++ }
END { print total_female_records };
NR >= 498 && $3 == 2015 && $4 ~ /male/ && $4 !~ /female/ { total_male_records++ }
END { print total_male_records }' education.csv
423
423

Much better. They have the same number of records. Out of curiosity, are there the same numbers of male and female students? Counts of students appear in the 5th column. Let’s sum up these counts for females and males:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
# Instead of increasing the counter, add the number of
# total female students to total_females
NR >= 498 && $3 == 2015 && $4 ~ /female/ { total_females += $5 }
END { print total_females };
# Instead of increasing the counter, add the number of
# total male students to total_males
NR >= 498 && $3 == 2015 && $4 ~ /male/ && $4 !~ /female/ { total_males += $5 }
END { print total_males }' education.csv
34658.5
33686

Almost equal!

Awk string manipulation

The last item we had noticed from our data (waaaay) earlier was that there was different information about primary, secondary, and tertiary education. Taking a look at data in this column again:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR >= 498 && $4 ~ /male/ && !seen[$4]++ { print $4 }' education.csv
Gross enrollement ratio - Primary (male)
Gross enrollment ratio - Primary (female)
Gross enrollment ratio - Secondary (male)
Gross enrollment ratio - Secondary (female)
Gross enrollment ratio - Tertiary (male)
Gross enrollment ratio - Tertiary (female)

The "Gross enrollment ratio - " isn’t really adding any information here since all of these values have it. Let’s get rid of it. Awk has many different string manipulation functions! We can use gsub to replace "Gross enrollment ratio - " with the empty string:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR >= 498 && $4 ~ /male/ && !seen[$4]++
{ gsub("Gross enrollment ratio - ", "", $4); print $4 }' education.csv
Gross enrollement ratio - Primary (male)
Primary (female)
Secondary (male)
Secondary (female)
Tertiary (male)
Tertiary (female)

It took me longer than I care to admit to figure out what was happening on the first line. But, the data has a typo. The first line has an extra e in enrollment. No worries, we can use a little regex here enroll[e]?ment:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR >= 498 && $4 ~ /male/ && !seen[$4]++
{ gsub("Gross enroll[e]?ment ratio - ", "", $4); print $4 }' education.csv
Primary (male)
Primary (female)
Secondary (male)
Secondary (female)
Tertiary (male)
Tertiary (female)

Nice, and summing up the students for each:

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR >= 498 && $4 ~ /male/ {
  gsub("Gross enroll[e]*ment ratio - ", "", $4);
  enrollments[$4] += $5
} END {
  for(key in enrollments) print key, enrollments[key]
}' education.csv
Tertiary (female) 30839.4
Primary (male) 105517
Secondary (male) 69915.7
Tertiary (male) 24960.2
Secondary (female) 69987.7
Primary (female) 102259

This output looks great, but it’s a little hard to read.

Sorting arrays

We can sort the output to make it easier to read. Arrays have two sorting functions: asort to sort by values and asorti to sort by indexes. We want to sort the data by indexes to make it more clear. What’s tricky then, is we still want to print the initial values once we’ve sorted the indexes. To do that, we’ll use enrollments[sorted_enrollment_indexes[i]]

$ gawk 'BEGIN { FPAT = "[^,]*|\"[^\"]+\"" }
NR >= 498 && $4 ~ /male/ { gsub("Gross enroll[e]*ment ratio - ", "", $0);
enrollments[$4] += $5 }
END { number_of_elements = asorti(enrollments, sorted_enrollment_indexes);
for(i = 1; i <= number_of_elements; i++)
print sorted_enrollment_indexes[i],
enrollments[sorted_enrollment_indexes[i]] }' education.csv
Primary (female) 102259
Primary (male) 105517
Secondary (female) 69987.7
Secondary (male) 69915.7
Tertiary (female) 30839.4
Tertiary (male) 24960.2

Neat! There are more males in primary education, it’s about even in secondary education, and then there are more females in tertiary education

More data exploration

I was curious about which countries had big discrepancies in males and females they educated, and so wrote a snippet to calculate this. We’ll use this snippet as a way to review some of what we’ve learned:

$ gawk 'BEGIN {
  # Our standard pattern matching for this file
  FPAT = "[^,]*|\"[^\"]+\"";

  ratio_threshold = 1.2;

  # Print out a header for our output
  print "Ratio of males to females educated above ratio threshold of", ratio_threshold
}

# Get data which applies to countries, in the year 2015, and is male/female
NR >= 498 && $3 == 2015 && $4 ~ /male/ {

  # $2 is the country column and $5 is the count column
  # This adds up sums per country of females and males
  # Awk has syntactic sugar for <condition> ? <true_case> : <false_case>
  ($4 ~ /female/) ? female_sums[$2] += $5 : male_sums[$2] += $5
} END {
  for (country in male_sums) {
    ratio = male_sums[country] / female_sums[country];

    # Check if either male:female or female:male are over the threshold
    if (ratio > ratio_threshold || 1 / ratio > ratio_threshold) {
      # This is tricky. Due to the way arrays sort in awk, we actually
      # want an array with indexes of ratio, and values of country so
      # we can sort by the ratios when we print later
      ratios[ratio] = country
    }
  };

  # Sort the array ratios by index
  # Put the sorted array into sorted_ratios
  asorti(ratios, sorted_ratios);
  for(ind in sorted_ratios) {
    ratio = sorted_ratios[ind]
    country = ratios[ratio];
    # Me being neurotic to get the printing to look nice
    (length(country) > 6) ? tabs = "\t" : tabs = "\t\t";
    print country, tabs, ratio
  }
}' education.csv
Ratio of males to females educated above ratio threshold of 1.2
Aruba		0.441774
Namibia		0.571526
Botswana	0.732033
Qatar		0.759564
Jamaica		0.829006
Benin		1.22078
Côte d’Ivoire	1.23023
Liechtenstein	1.23488
Niger		1.25275
Chad		1.42568
South Sudan	1.46199
Afghanistan	1.54712

This list did not match my expectations! But an important caveat is that not all countries had data for primary, secondary and tertiary education. If you’re curious about this, I would definitely encourage you to play with it some more to test our your new awk skills!

TL;DR

A quick review of the awk concepts we learned:

  • Arrays in awk have indexes and values
  • The pattern !seen[<column_index>]++ will get us unique rows by a column value
  • For loops have syntax for(i = 1; i < 10; i++) or for(record in records)
  • Awk can have distinct pattern / action pairs
  • If something matches multiple different patterns, it will execute multiple actions
  • Awk has many string manipulation functions, including gsub to search and replace
  • Sort arrays by value with asort(array, dest) and by index with asorti(array, dest)