Data Wrangling

# Data Wrangling #

Lecture source: https://missing.csail.mit.edu/2020/data-wrangling/

`?` after the quantifier in regular expression (regex): turn greedy matching to a non-greedy matching, see the following examples:

``````regex:  b\w+?
text:   b be bee beer beers
match:    ^^ ^^  ^^   ^^
``````
``````regex:  b\w+?
text:   b be bee beer beers
match:    ^^ ^^^ ^^^^ ^^^^^
``````

`paste -sd,`: turn the column data to a single line (`-s`) with delimeter `,` (`-d,`) .

``````\$ ls | paste -sd,
``````

`bc`: Berkely calculator, it reads from the stdin and calculates.

``````\$ echo "1 + 2" | bc -l
3
``````

`sort -nk1,1`: sort the keys (started from column 1, ended at column 1) numerically.

``````\$ echo "2 1\n1 3" | sort -nk1,1
1 3
2 1
``````

## Exercises #

1. Take this short interactive regex tutorial.

Lessons:

1. Matching Characters: `abc.*`

Matching Digits: `\d{3}`

2. Matching With Wildcards: `\.`

3. Matching Characters: `[cmf]an`

4. Excluding Characters: `[^b]og`

5. Matching Character Ranges: `[A-Z].*`

6. Matching Repeated Characters: `waz{3,}up`

7. Matching Repeated Characters: `aa+b*c+`

8. Matching Optional Characters: `\d+ files? found\?`

9. Matching Whitespaces: `\d+?.\s+abc`

10. Matching Lines: `^Mission: successful\$`

11. Matching Groups: `^(file_.+)\.pdf\$`

12. Matching Nested Groups: `^(\w{3} (\d{4}))\$`

13. Matching Nested Groups: `(\d{4})x(\d+)`

14. Matching Conditional Text: `I love (cat|dog)s`

15. Matching Other Special Characters: `.*`

Problems:

1. Matching Numbers: `^-?[\d,]+\.?\d+(e\d+)?\$`
2. Matching Phone Numbers: `(1\s)?\(?(\d{3})[-|)|\s]?\d{3}[-|)|\s]?\d{4}`
3. Matching Emails: `([\w\.]+)`
4. Capturing HTML Tags: `<(\w+)`
5. Capturing Filename Data: `(\w[\w_]*)\.(jpg|png|gif)\$`
6. Matching Lines: `\s*(.*)\s*`
7. Extracting Data From Log Entries: `(\w+)\(([\w\.]*):(\d*)\)`
8. Extracting Data From URLs: `(\w+)://([\w_\-\.]+)(?:\:(\d+))?`
2. Find the number of words (in `/usr/share/dict/words`) that contain at least three `a`s and don’t have a `'s` ending. What are the three most common last two letters of those words? `sed`’s `y` command, or the `tr` program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?

The number of valid words:

``````\$ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | wc -l
7595
``````

The three most common last two letters of those words:

``````\$ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | sed -E 's/.*(..)/\1/' | sort | uniq -c | sort -nk1,1 | tail -n 3
763 an
814 ia
1039 al
``````

Number of two-letter combinations:

``````\$ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | sed -E 's/.*(..)/\1/' | sort | uniq | wc -l
156
``````

The combinations do not occur:

``````# TODO
``````
3. To do in-place substitution it is quite tempting to do something like `sed s/REGEX/SUBSTITUTION/ input.txt > input.txt`. However this is a bad idea, why? Is this particular to `sed`? Use `man sed` to find out how to accomplish this.

Use `-i` to do in-place substution in `sed`:

``````-i extension
Edit files in-place similarly to -I, but treat each file independently from
other files.  In particular, line numbers in each file start at 1, the “\$”
address matches the last line of the current file, and address ranges are
limited to the current file.  (See Sed Addresses.) The net result is as
though each file were edited by a separate sed instance.
``````
4. Find your average, median, and max system boot time over the last ten boots. Use

``````journalctl
``````

on Linux and

``````log show
``````

on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:

``````Logs begin at ...
``````

and

``````systemd: Startup finished in ...
``````

On macOS, look for:

``````=== system boot:
``````

and

``````Previous shutdown cause: 5
``````

⚠️TBD

Get macOS boot begining log:

``````log show  --last 72h | grep "=== system boot:" |
``````

Get macOS boot end log:

``````log show  --last 72h | grep "Previous shutdown cause: 5"
``````

If you are using R to get the statistics, then install R first.

``````\$ brew install r
``````

If you are using st to get the statistics, then install st first.

``````\$ brew install st
``````

If you want to get the statistics from a bunch of numbers, using `R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'` from the class or `st --complete` instead.

5. Look for boot messages that are not shared between your past three reboots (see `journalctl`’s `-b` flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can use `sed '0,/STRING/d'` to remove all lines previous to one that matches `STRING`. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (`uniq` is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).

TODO

6. Find an online data set like this one, this one, or maybe one from here. Fetch it using `curl` and extract out just two columns of numerical data. If you’re fetching HTML data, `pup` might be helpful. For JSON data, try `jq`. Find the min and max of one column in a single command, and the difference of the sum of each column in another.

TODO