Splitting a file on lines matching a pattern
Imagine you want to split a file on lines matching a pattern, like on empty
lines, or on the sections of a markdown file (## Section).
How do you do that on the cli?
1. Unix/Linux split
split, just like most Unix/Linux tools, works on a per-line basis.
Therefore, it is not usable.
2. csplit
Gnu coreutils comes with csplit for this purpose, although its syntax is
peculiar:
csplit --quiet --elide-empty-files --suppress-matched <input_file> '/^$/' '{*}'
This will split input_file on empty lines (/^$/), as many times as possible ({*}),
and save the outputs to files named xx01, xx02, etc.
It’s a terminal operation, as in "cannot be piped to something else".
3. awk
Similarly, we can use awk to do the same:
awk 'BEGIN {x="xx0"} /^$/{x="xx"++i;} {print > x}' <input_file>
This is:
-
creating a variable
xset toxx0at the very beginning -
only updating it to
xx1if pattern is matched (/^$/) -
in all cases, redirecting the output of
printto the file named by the variablex.
This operation is also terminal.
4. jq
Doing it with jq allows us to have the outputs in a structured format, and so
we can pipe it to another command in a pipe:
jq -nR --stream '[inputs] | reduce .[] as $item ([[]]; if $item | test("^$") then . += [[]] else .[-1] += [$item] end)' <input_file>
This reads the inputs as a list of lines, and split it, returning a list of list of lines.
5. Java
For fun, we can try to implement it in Java (11+):
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.StandardOpenOption;
import java.nio.file.Paths;
import java.util.regex.Pattern;
public class SplitBySeparators {
public static void main(final String args[]) throws IOException {
if (args.length != 2) {
System.err.println("Usage: SplitBySeparators <filename> <separator>");
System.exit(1);
}
final String filename = args[0];
final Pattern pattern = Pattern.compile(args[1]);
int counter = -1;
for (final String line: Files.lines(Paths.get(filename)).toList()) {
var option = StandardOpenOption.APPEND;
String lineToAdd = "\n" + line;
if (pattern.matcher(line).matches() || counter == -1) {
counter += 1;
option = StandardOpenOption.CREATE;
lineToAdd = line;
}
Files.write(Paths.get(filename + counter),
lineToAdd.getBytes(StandardCharsets.UTF_8),
option);
}
}
}
Compile it once:
javac SplitBySeparators.java
And then use it:
java SplitBySeparators <input_file> '^$'
Optionally, you can even compile it to a native binary:
native-image --no-server --static SplitBySeparators SplitBySeparators
and use it:
./SplitBySeparators <input_file> '^$'