Data Wrangling #
Lecture source: https://missing.csail.mit.edu/2020/data-wrangling/
?
after the quantifier in regular expression (regex): turn greedy matching to a non-greedy matching, see the following examples:
regex: b\w+?
text: b be bee beer beers
match: ^^ ^^ ^^ ^^
regex: b\w+?
text: b be bee beer beers
match: ^^ ^^^ ^^^^ ^^^^^
paste -sd,
: turn the column data to a single line (-s
) with delimeter ,
(-d,
) .
$ ls | paste -sd,
Applications,Code,Desktop,Documents,Downloads,Library,Movies,Music
bc
: Berkely calculator, it reads from the stdin and calculates.
$ echo "1 + 2" | bc -l
3
sort -nk1,1
: sort the keys (started from column 1, ended at column 1) numerically.
$ echo "2 1\n1 3" | sort -nk1,1
1 3
2 1
Exercises #
-
Take this short interactive regex tutorial.
Lessons:
-
Matching Characters:
abc.*
Matching Digits:
\d{3}
-
Matching With Wildcards:
\.
-
Matching Characters:
[cmf]an
-
Excluding Characters:
[^b]og
-
Matching Character Ranges:
[A-Z].*
-
Matching Repeated Characters:
waz{3,}up
-
Matching Repeated Characters:
aa+b*c+
-
Matching Optional Characters:
\d+ files? found\?
-
Matching Whitespaces:
\d+?.\s+abc
-
Matching Lines:
^Mission: successful$
-
Matching Groups:
^(file_.+)\.pdf$
-
Matching Nested Groups:
^(\w{3} (\d{4}))$
-
Matching Nested Groups:
(\d{4})x(\d+)
-
Matching Conditional Text:
I love (cat|dog)s
-
Matching Other Special Characters:
.*
Problems:
- Matching Numbers:
^-?[\d,]+\.?\d+(e\d+)?$
- Matching Phone Numbers:
(1\s)?\(?(\d{3})[-|)|\s]?\d{3}[-|)|\s]?\d{4}
- Matching Emails:
([\w\.]+)
- Capturing HTML Tags:
<(\w+)
- Capturing Filename Data:
(\w[\w_]*)\.(jpg|png|gif)$
- Matching Lines:
\s*(.*)\s*
- Extracting Data From Log Entries:
(\w+)\(([\w\.]*):(\d*)\)
- Extracting Data From URLs:
(\w+)://([\w_\-\.]+)(?:\:(\d+))?
-
-
Find the number of words (in
/usr/share/dict/words
) that contain at least threea
s and don’t have a's
ending. What are the three most common last two letters of those words?sed
’sy
command, or thetr
program, may help you with case insensitivity. How many of those two-letter combinations are there? And for a challenge: which combinations do not occur?
The number of valid words:
$ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | wc -l 7595
The three most common last two letters of those words:
$ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | sed -E 's/.*(..)/\1/' | sort | uniq -c | sort -nk1,1 | tail -n 3 763 an 814 ia 1039 al
Number of two-letter combinations:
$ cat /usr/share/dict/words | sed -E 'y/A/a/' | grep -e "a.*a.*a" | grep -v -e ".*s'" | sed -E 's/.*(..)/\1/' | sort | uniq | wc -l 156
The combinations do not occur:
# TODO
-
To do in-place substitution it is quite tempting to do something like
sed s/REGEX/SUBSTITUTION/ input.txt > input.txt
. However this is a bad idea, why? Is this particular tosed
? Useman sed
to find out how to accomplish this.
Use
-i
to do in-place substution insed
:-i extension Edit files in-place similarly to -I, but treat each file independently from other files. In particular, line numbers in each file start at 1, the “$” address matches the last line of the current file, and address ranges are limited to the current file. (See Sed Addresses.) The net result is as though each file were edited by a separate sed instance.
-
Find your average, median, and max system boot time over the last ten boots. Use
journalctl
on Linux and
log show
on macOS, and look for log timestamps near the beginning and end of each boot. On Linux, they may look something like:
Logs begin at ...
and
systemd[577]: Startup finished in ...
On macOS, look for:
=== system boot:
and
Previous shutdown cause: 5
⚠️TBD
Get macOS boot begining log:
log show --last 72h | grep "=== system boot:" |
Get macOS boot end log:
log show --last 72h | grep "Previous shutdown cause: 5"
If you are using R to get the statistics, then install R first.
$ brew install r
If you are using st to get the statistics, then install st first.
$ brew install st
If you want to get the statistics from a bunch of numbers, using
R --slave -e 'x <- scan(file="stdin", quiet=TRUE); summary(x)'
from the class orst --complete
instead. -
Look for boot messages that are not shared between your past three reboots (see
journalctl
’s-b
flag). Break this task down into multiple steps. First, find a way to get just the logs from the past three boots. There may be an applicable flag on the tool you use to extract the boot logs, or you can usesed '0,/STRING/d'
to remove all lines previous to one that matchesSTRING
. Next, remove any parts of the line that always varies (like the timestamp). Then, de-duplicate the input lines and keep a count of each one (uniq
is your friend). And finally, eliminate any line whose count is 3 (since it was shared among all the boots).
TODO
-
Find an online data set like this one, this one, or maybe one from here. Fetch it using
curl
and extract out just two columns of numerical data. If you’re fetching HTML data,pup
might be helpful. For JSON data, tryjq
. Find the min and max of one column in a single command, and the difference of the sum of each column in another.
TODO