Data Wrangling

Data Wrangling #

Lecture source: https://missing.csail.mit.edu/2020/data-wrangling/

? after the quantifier in regular expression (regex): turn greedy matching to a non-greedy matching, see the following examples:

regex:  b\w+?
text:   b be bee beer beers
match:    ^^ ^^  ^^   ^^
regex:  b\w+?
text:   b be bee beer beers
match:    ^^ ^^^ ^^^^ ^^^^^

paste -sd,: turn the column data to a single line (-s) with delimeter , (-d,) .

$ ls | paste -sd,
Applications,Code,Desktop,Documents,Downloads,Library,Movies,Music

bc: Berkely calculator, it reads from the stdin and calculates.

$ echo "1 + 2" | bc -l
3

sort -nk1,1: sort the keys (started from column 1, ended at column 1) numerically.

$ echo "2 1\n1 3" | sort -nk1,1
1 3
2 1

Exercises #

  1. Take this short interactive regex tutorial.


    Lessons:

    1. Matching Characters: abc.*

      Matching Digits: \d{3}

    2. Matching With Wildcards: \.

    3. Matching Characters: [cmf]an

    4. Excluding Characters: [^b]og

    5. Matching Character Ranges: [A-Z].*

    6. Matching Repeated Characters: waz{3,}up

    7. Matching Repeated Characters: aa+b*c+

    8. Matching Optional Characters: \d+ files? found\?

    9. Matching Whitespaces: \d+?.\s+abc

    10. Matching Lines: ^Mission: successful$

    11. Matching Groups: ^(file_.+)\.pdf$

    12. Matching Nested Groups: ^(\w{3} (\d{4}))$

    13. Matching Nested Groups: (\d{4})x(\d+)

    14. Matching Conditional Text: I love (cat|dog)s

    15. Matching Other Special Characters: .*

    Problems:

    1. Matching Numbers: ^-?[\d,]+\.?\d+(e\d+)?$
    2. Matching Phone Numbers: (1\s)?\(?(\d{3})[-|)|\s]?\d{3}[-|)|\s]?\d{4}
    3. Matching Emails: ([\w\.]+)
    4. Capturing HTML Tags: <(\w+)
    5. Capturing Filename Data: (\w[\w_]*)\.(jpg|png|gif)$
    6. Matching Lines: \s*(.*)\s*
    7. Extracting Data From Log Entries: (\w+)\(([\w\.]*):(\d*)\)
    8. Extracting Data From URLs: (\w+)://([\w_\-\.]+)(?:\:(\d+))?
  2. Find the number of words (in /usr/share/dict/words) that contain at least three as and don’t have a 's ending. What are the three most common last two letters of those words? sed’s y command, or the tr program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?


    The number of valid words:

    $ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | wc -l
        7595
    

    The three most common last two letters of those words:

    $ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | sed -E 's/.*(..)/\1/' | sort | uniq -c | sort -nk1,1 | tail -n 3
     763 an
     814 ia
    1039 al
    

    Number of two-letter combinations:

    $ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | sed -E 's/.*(..)/\1/' | sort | uniq | wc -l
         156
    

    The combinations do not occur:

    # TODO
    
  3. To do in-place substitution it is quite tempting to do something like sed s/REGEX/SUBSTITUTION/ input.txt > input.txt. However this is a bad idea, why? Is this particular to sed? Use man sed to find out how to accomplish this.


    Use -i to do in-place substution in sed:

    -i extension
            Edit files in-place similarly to -I, but treat each file independently from
            other files.  In particular, line numbers in each file start at 1, the “$”
            address matches the last line of the current file, and address ranges are
            limited to the current file.  (See Sed Addresses.) The net result is as
            though each file were edited by a separate sed instance.
    
  4. Find your average, median, and max system boot time over the last ten boots. Use

    journalctl
    

    on Linux and

    log show 
    

    on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:

    Logs begin at ...
    

    and

    systemd[577]: Startup finished in ...
    

    On macOS, look for:

    === system boot:
    

    and

    Previous shutdown cause: 5
    

    ⚠️TBD

    Get macOS boot begining log:

    log show  --last 72h | grep "=== system boot:" | 
    

    Get macOS boot end log:

    log show  --last 72h | grep "Previous shutdown cause: 5"
    

    If you are using R to get the statistics, then install R first.

    $ brew install r
    

    If you are using st to get the statistics, then install st first.

    $ brew install st
    

    If you want to get the statistics from a bunch of numbers, using R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)' from the class or st --complete instead.

  5. Look for boot messages that are not shared between your past three reboots (see journalctl’s -b flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use sed '0,/STRING/d' to remove all lines previous to one that matches STRING. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).


    TODO

  6. Find an online data set like this one, this one, or maybe one from here. Fetch it using curl and extract out just two columns of numerical data. If you’re fetching HTML data, pup might be helpful. For JSON data, try jq. Find the min and max of one column in a single command, and the difference of the sum of each column in another.


    TODO