Splitting a file on lines matching a pattern
Imagine you want to split a file on lines matching a pattern, like on empty
lines, or on the sections of a markdown file (## Section
).
How do you do that on the cli?
1. Unix/Linux split
split
, just like most Unix/Linux tools, works on a per-line basis.
Therefore, it is not usable.
2. csplit
Gnu coreutils comes with csplit
for this purpose, although its syntax is
peculiar:
csplit --quiet --elide-empty-files --suppress-matched <input_file> '/^$/' '{*}'
This will split input_file
on empty lines (/^$/
), as many times as possible ({*}
),
and save the outputs to files named xx01
, xx02
, etc.
It’s a terminal operation, as in "cannot be piped to something else".
3. awk
Similarly, we can use awk to do the same:
awk 'BEGIN {x="xx0"} /^$/{x="xx"++i;} {print > x}' <input_file>
This is:
-
creating a variable
x
set toxx0
at the very beginning -
only updating it to
xx1
if pattern is matched (/^$/
) -
in all cases, redirecting the output of
print
to the file named by the variablex
.
This operation is also terminal.
4. jq
Doing it with jq
allows us to have the outputs in a structured format, and so
we can pipe it to another command in a pipe:
jq -nR --stream '[inputs] | reduce .[] as $item ([[]]; if $item | test("^$") then . += [[]] else .[-1] += [$item] end)' <input_file>
This reads the inputs as a list of lines, and split it, returning a list of list of lines.
5. Java
For fun, we can try to implement it in Java (11+):
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.StandardOpenOption;
import java.nio.file.Paths;
import java.util.regex.Pattern;
public class SplitBySeparators {
public static void main(final String args[]) throws IOException {
if (args.length != 2) {
System.err.println("Usage: SplitBySeparators <filename> <separator>");
System.exit(1);
}
final String filename = args[0];
final Pattern pattern = Pattern.compile(args[1]);
int counter = -1;
for (final String line: Files.lines(Paths.get(filename)).toList()) {
var option = StandardOpenOption.APPEND;
String lineToAdd = "\n" + line;
if (pattern.matcher(line).matches() || counter == -1) {
counter += 1;
option = StandardOpenOption.CREATE;
lineToAdd = line;
}
Files.write(Paths.get(filename + counter),
lineToAdd.getBytes(StandardCharsets.UTF_8),
option);
}
}
}
Compile it once:
javac SplitBySeparators.java
And then use it:
java SplitBySeparators <input_file> '^$'
Optionally, you can even compile it to a native binary:
native-image --no-server --static SplitBySeparators SplitBySeparators
and use it:
./SplitBySeparators <input_file> '^$'