Monday, July 23, 2018

csplit: A Better Way to Split File in Linux Based on its Content

https://linuxhandbook.com/csplit-command

When it comes to splitting a text file into multiple files in Linux, most people use the split command. Nothing wrong with the split command except that it relies on the byte size or line size for splitting the files.
This is not convenient in situations where you need to split files based on its content, instead of size. Let me give you an example.
I manage my scheduled tweets using YAML files. A typical tweet file contains several tweets, separated by four dashes:
  ----
    event:
      repeat: { days: 180 }
    status: |
      I think I use the `sed` command daily. And you?

      https://www.yesik.it/EP07
      #Shell #Linux #Sed #YesIKnowIT
  ----
    status: |
      Print the first column of a space-separated data file:
      awk '{print $1}' data.txt # Print out just the first column

      For some unknown reason, I find that easier to remember than:
      cut -f1 data.txt

      #Linux #AWK #Cut
  ----
    status: |
      For the #shell #beginners :
[...]
When importing them into my system, I need to write each tweet to its own file. I do that to avoid registering duplicate tweets.
But how to split a file into several parts based on its content? Well, probably you can obtain something convincing using awk commands:
  sh$ awk < tweets.yaml '
  >     /----/ { OUTPUT="tweet." (N++) ".yaml" }
  >     { print > OUTPUT }
  > '
However, despite a relative simplicity, such a solution is not very robust: for example, I didn’t properly close the various output files, so this might very well reach the open files limit. Or what if I forgot the separator before the very first tweet of the file? Of course, all that can be handled and fixed in the AWK script, at the expense of making it more complex. But why bothering with that when we have the csplit tool to accomplish that task?

Using csplit to split files in Linux

csplit command is a powerful tool for splitting files in Linux
The csplit tool is a cousin of the split tool that can be used to split a file into fixed-size chunks. But csplit will identify the chunk boundaries based on the file content, rather than using byte count.
In this tutorial, I’ll demonstrate csplit command usage and will also explain the output of this command.
So, for example, if I want to split my tweet file based on the ---- delimiter, I could write:
  sh$ csplit tweets.yaml /----/
  0
  10846
You may have guessed the csplit tool used the regex provided on the command line to identify the separator. And what could be those 0 and 10983 result displayed on the standard output? Well, they are the size in bytes of each created chunk of data.
  sh$ ls -l xx0*
  -rw-r--r-- 1 sylvain sylvain     0 Jun  6 11:30 xx00
  -rw-r--r-- 1 sylvain sylvain 10846 Jun  6 11:30 xx01
Wait a minute! Where those xx00 and xx01 filenames are coming from? And why csplit split the file into two chunks only? And why the first data chunk has a length of zero bytes?
The answer to the first question is simple: xxNN (or more formally xx%02d) is the default filename format used by csplit. But you can change that using the --suffix-format and --prefix options. For example, I could change the format to something more meaningful for my needs:
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     /----/
  0
  10846

  sh$ ls -l tweet.*
  -rw-r--r-- 1 sylvain sylvain     0 Jun  6 11:30 tweet.000.yaml
  -rw-r--r-- 1 sylvain sylvain 10846 Jun  6 11:30 tweet.001.yaml
The prefix is a plain string, but the suffix is a format string like the one used by the standard C library printf function. Most characters of the format will be used verbatim, except for conversion specifications which are introduced by the percent sign (%) and which ends with a conversion specifier (here, d). In between, the format may also contain various flags and options. In my example, the %03d conversion specification means:
  • display the chunk number as a decimal integer (d),
  • in a three characters width field (3),
  • eventually padded on the left with zeros (0).
But that does not address the other interrogations I had above: why do we have only two chunks, one of them containing zero bytes? Maybe do you already have found the answer to that latter question by yourself: my data file starts with ---- on its very first line. So, csplit considered it as a delimiter, and since there was no data before that line, it created an empty first chunk. We can disable the creation of zero bytes length files using the --elide-empty-files option:
  sh$ rm tweet.*
  rm: cannot remove 'tweet.*': No such file or directory
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/
  10846

  sh$ ls -l tweet.*
  -rw-r--r-- 1 sylvain sylvain 10846 Jun  6 11:30 tweet.000.yaml
Ok: no more empty files. But in a sense, the result is worst now, since csplit split the file in just one chunk. We barely can call that “splitting” a file, can’t we?
The explanation for that surprising result is csplit does not at all assume each chuck should be split based on the same separator. Actually, csplit requires you to provide each separator used. Even if it is several times the same:
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/ /----/ /----/
  170
  250
  10426
I’ve put three (identical) separators on the command line. So, csplit identified the end of the first chunk based on the first separator. It leads to a zero bytes length chunk that was elided. The second chunk was delimited by the next line matching /----/. Leading to a 170 bytes chunk. Finally, a third 250 bytes length chunk was identified based on the third separator. The remaining data, 10426 bytes, were put into the last chunk.
  sh$ ls -l tweet.???.yaml
  -rw-r--r-- 1 sylvain sylvain   170 Jun  6 11:30 tweet.000.yaml
  -rw-r--r-- 1 sylvain sylvain   250 Jun  6 11:30 tweet.001.yaml
  -rw-r--r-- 1 sylvain sylvain 10426 Jun  6 11:30 tweet.002.yaml
Obviously, it wouldn’t be practical if we had to provide as many separators on the command line as there are chunks in the data file. Especially since that exact number is usually not known in advance. Fortunately, csplit has a special pattern meaning “repeat the previous pattern as much as possible.” Despite its syntax reminding the star quantifier in a regular expression, this is closer to the Kleene plus concept since it is used to repeat a separator that has already been matched once:
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/ '{*}'
  170
  250
  190
  208
  140
[...]
  247
  285
  194
  214
  185
  131
  316
  221
And this time, finally, I have split my tweet collection into individual parts. However, does csplip have some other nice “special” patterns like that? Well, I don’t know if we can call them “special”, but definitely, csplit understand more of patterns.

More csplit patterns

We’ve just seen in the preceding section how to use the ‘{*}’ quantifier for unbound repetitions. However, by replacing the star with a number, you can request an exact number of repetitions:
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/ '{6}'
  170
  250
  190
  208
  140
  216
  9672
That leads to an interesting corner case. What would append if the number of repetition exceeded the number of actual delimiters in the data file? Well, let’s see that on an example:
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/ '{999}'
  csplit: ‘/----/’: match not found on repetition 62
  170
  250
  190
  208
[...]
  91
  247
  285
  194
  214
  185
  131
  316
  221

  sh$ ls tweet.*
  ls: cannot access 'tweet.*': No such file or directory
Interestingly, not only csplit reported an error, but it also removed all the chunk files created during the process. Pay special attention to my wording: it removed them. That means the files were created, then, when csplit encountered the error, it deleted them. In other words, if you already have a file whose name looks like a chunk file, it will be removed:
  sh$ touch tweet.002.yaml
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     /----/ '{999}'
  csplit: ‘/----/’: match not found on repetition 62
  170
  250
  190
[...]
  87
  91
  247
  285
  194
  214
  185
  131
  316
  221

  sh$ ls tweet.*
  ls: cannot access 'tweet.*': No such file or directory
In the above example, the tweet.002.yaml file we’ve manually created was overwritten, then removed by csplit.
You can change that behavior using the --keep-files option. As its name implies it, it will not remove chunks csplit created after encountering an error:
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     /----/ '{999}'
  csplit: ‘/----/’: match not found on repetition 62
  170
  250
  190
[...]
  316
  221

  sh$ ls tweet.*
  tweet.000.yaml
  tweet.001.yaml
  tweet.002.yaml
  tweet.003.yaml
[...]
  tweet.058.yaml
  tweet.059.yaml
  tweet.060.yaml
  tweet.061.yaml
Notice in that case, and despite the error, csplit didn’t discard any data:
  sh$ diff -s tweets.yaml <(cat tweet.*)
  Files tweets.yaml and /dev/fd/63 are identical
But what if there are some data in the file I want to discard? Well, csplit has some limited support for that using a %regex% pattern.

Skipping data in csplit

When using a percent sign (%) as the regex delimiter instead of a slash (/), csplit will skip data up to (but not including) the first line matching the regular expression. This may be useful to ignore some records, especially at the start or the end of the input file:
  sh$ # Keep only the first two tweets
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     /----/ '{2}' %----% '{*}'
  170
  250

  sh$ head tweet.00[012].yaml
  ==> tweet.000.yaml <==
  ----
    event:
      repeat: { days: 180 }
    status: |
      I think I use the `sed` command daily. And you?

      https://www.yesik.it/EP07
      #Shell #Linux #Sed #YesIKnowIT

  ==> tweet.001.yaml <==
  ----
    status: |
      Print the first column of a space-separated data file:
      awk '{print $1}' data.txt # Print out just the first column

      For some unknown reason, I find that easier to remember than:
      cut -f1 data.txt

      #Linux #AWK #Cut
  sh$ # Skip the first two tweets
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %----% '{2}' /----/ '{2}'
  190
  208
  140
  9888

  sh$ head tweet.00[012].yaml
  ==> tweet.000.yaml <==
  ----
    status: |
      For the #shell #beginners :
      « #GlobPatterns : how to move hundreds of files in not time [1/3] »
      https://youtu.be/TvW8DiEmTcQ

      #Unix #Linux
      #YesIKnowIT

  ==> tweet.001.yaml <==
  ----
    status: |
      Want to know the oldest file in your disk?

      find / -type f -printf '%TFT%.8TT %p\n' | sort | less
      (should work on any Single UNIX Specification compliant system)
      #UNIX #Linux

  ==> tweet.002.yaml <==
  ----
    status: |
      When using the find command, use `-iname` instead of `-name` for case-insensitive search
      #Unix #Linux #Shell #Find
  sh$ # Keep only the third and fourth tweets
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %----% '{2}' /----/ '{2}' %----% '{*}'
  190
  208
  140

  sh$ head tweet.00[012].yaml
  ==> tweet.000.yaml <==
  ----
    status: |
      For the #shell #beginners :
      « #GlobPatterns : how to move hundreds of files in not time [1/3] »
      https://youtu.be/TvW8DiEmTcQ

      #Unix #Linux
      #YesIKnowIT

  ==> tweet.001.yaml <==
  ----
    status: |
      Want to know the oldest file in your disk?

      find / -type f -printf '%TFT%.8TT %p\n' | sort | less
      (should work on any Single UNIX Specification compliant system)
      #UNIX #Linux

  ==> tweet.002.yaml <==
  ----
    status: |
      When using the find command, use `-iname` instead of `-name` for case-insensitive search
      #Unix #Linux #Shell #Find

Using offsets while splitting files with csplit

When using regular expressions (either /…​/ or %…​%) you can specify a positive (+N) or negative (-N) offset at the end of the pattern so csplit will split the file N lines after or before the matching line. Remember, in all cases, the pattern specifies the end of the chunk:
  sh$ csplit tweets.yaml \
  >     --prefix='tweet.' --suffix-format='%03d.yaml' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %----%+1 '{2}' /----/+1 '{2}' %----% '{*}'
  190
  208
  140

  sh$ head tweet.00[012].yaml
  ==> tweet.000.yaml <==
    status: |
      For the #shell #beginners :
      « #GlobPatterns : how to move hundreds of files in not time [1/3] »
      https://youtu.be/TvW8DiEmTcQ

      #Unix #Linux
      #YesIKnowIT
  ----

  ==> tweet.001.yaml <==
    status: |
      Want to know the oldest file in your disk?

      find / -type f -printf '%TFT%.8TT %p\n' | sort | less
      (should work on any Single UNIX Specification compliant system)
      #UNIX #Linux
  ----

  ==> tweet.002.yaml <==
    status: |
      When using the find command, use `-iname` instead of `-name` for case-insensitive search
      #Unix #Linux #Shell #Find
  ----

Split by line number

We have already seen how we can use a regular expression to split files. In that case, csplit will split the file at the first line matching that regex. But you can also identify the split line by its line number as we will see it now.
Before switching to YAML, I used to store my scheduled tweets in a flat file.
In that file, a tweet was made of two lines. One containing an optional repetition, and the second containing the text of the tweet, with newlines replaced by \n. Once again that sample file is available online.
With that “fixed size” format too was able to use csplit to put each individual tweet into its own file:
  sh$ csplit tweets.txt \
  >     --prefix='tweet.' --suffix-format='%03d.txt' \
  >     --elide-empty-files \
  >     --keep-files \
  >     2 '{*}'
  csplit: ‘2’: line number out of range on repetition 62
  1
  123
  222
  161
  182
  119
  184
  81
  148
  128
  142
  101
  107
[...]
  sh$ diff -s tweets.txt <(cat tweet.*.txt)
  Files tweets.txt and /dev/fd/63 are identical
  sh$ head tweet.00[012].txt
  ==> tweet.000.txt <==


  ==> tweet.001.txt <==
  { days:180 }
  I think I use the `sed` command daily. And you?\n\nhttps://www.yesik.it/EP07\n#Shell #Linux #Sed\n#YesIKnowIT

  ==> tweet.002.txt <==
  {}
  Print the first column of a space-separated data file:\nawk '{print $1}' data.txt # Print out just the first column\n\nFor some unknown reason, I find that easier to remember than:\ncut -f1 data.txt\n\n#Linux #AWK #Cut
The example above seems easy to understand, but there are two pitfalls here. First, the 2 given as an argument to csplit is a line number, not a line count. However, when using a repetition as I did, after the first match, csplit will use that number as a line count. If it’s not clear, I let you compare the output of the three following commands:
  sh$ csplit tweets.txt --keep-files 2 2 2 2 2
  csplit: warning: line number ‘2’ is the same as preceding line number
  csplit: warning: line number ‘2’ is the same as preceding line number
  csplit: warning: line number ‘2’ is the same as preceding line number
  csplit: warning: line number ‘2’ is the same as preceding line number
  1
  0
  0
  0
  0
  9030
  sh$ csplit tweets.txt --keep-files 2 4 6 8 10
  1
  123
  222
  161
  182
  8342
  sh$ csplit tweets.txt --keep-files 2 '{4}'
  1
  123
  222
  161
  182
  8342
I mentioned a second pitfall, somewhat related to the first one. Maybe did you notice the empty line at the very top of the tweets.txt file? It leads to that tweet.000.txt chunk that contains only the newline character. Unfortunately, it was required in that example because of the repetition: remember I want two lines chunks. So the 2 is mandatory before the repetition. But that also means the first chunk will break at, but not including, the line two. In other words, the first chunk contains one line. All the other ones will contain 2 lines. Maybe you could share your opinion in the comment section, but as of myself I think this was an unfortunate design choice.
You can mitigate that issue by skipping directly to the first non-empty line:
  sh$ csplit tweets.txt \
  >     --prefix='tweet.' --suffix-format='%03d.txt' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %.% 2 '{*}'
  csplit: ‘2’: line number out of range on repetition 62
  123
  222
  161
[...]
  sh$ head tweet.00[012].txt
  ==> tweet.000.txt <==
  { days:180 }
  I think I use the `sed` command daily. And you?\n\nhttps://www.yesik.it/EP07\n#Shell #Linux #Sed\n#YesIKnowIT

  ==> tweet.001.txt <==
  {}
  Print the first column of a space-separated data file:\nawk '{print $1}' data.txt # Print out just the first column\n\nFor some unknown reason, I find that easier to remember than:\ncut -f1 data.txt\n\n#Linux #AWK #Cut

  ==> tweet.002.txt <==
  {}
  For the #shell #beginners :\n« #GlobPatterns : how to move hundreds of files in not time [1/3] »\nhttps://youtu.be/TvW8DiEmTcQ\n\n#Unix #Linux\n#YesIKnowIT

Reading from stdin

Of course, like most of the command line tools, csplit can read the input data from its standard input. In that case, you have to specify - as the input filename:
  sh$ tr [:lower:] [:upper:] < tweets.txt | csplit - \
  >     --prefix='tweet.' --suffix-format='%03d.txt' \
  >     --elide-empty-files \
  >     --keep-files \
  >     %.% 2 '{3}'
  123
  222
  161
  8524

  sh$ head tweet.???.txt
  ==> tweet.000.txt <==
  { DAYS:180 }
  I THINK I USE THE `SED` COMMAND DAILY. AND YOU?\N\NHTTPS://WWW.YESIK.IT/EP07\N#SHELL #LINUX #SED\N#YESIKNOWIT

  ==> tweet.001.txt <==
  {}
  PRINT THE FIRST COLUMN OF A SPACE-SEPARATED DATA FILE:\NAWK '{PRINT $1}' DATA.TXT # PRINT OUT JUST THE FIRST COLUMN\N\NFOR SOME UNKNOWN REASON, I FIND THAT EASIER TO REMEMBER THAN:\NCUT -F1 DATA.TXT\N\N#LINUX #AWK #CUT

  ==> tweet.002.txt <==
  {}
  FOR THE #SHELL #BEGINNERS :\N« #GLOBPATTERNS : HOW TO MOVE HUNDREDS OF FILES IN NOT TIME [1/3] »\NHTTPS://YOUTU.BE/TVW8DIEMTCQ\N\N#UNIX #LINUX\N#YESIKNOWIT

  ==> tweet.003.txt <==
  {}
  WANT TO KNOW THE OLDEST FILE IN YOUR DISK?\N\NFIND / -TYPE F -PRINTF '%TFT%.8TT %P\N' | SORT | LESS\N(SHOULD WORK ON ANY SINGLE UNIX SPECIFICATION COMPLIANT SYSTEM)\N#UNIX #LINUX
  {}
  WHEN USING THE FIND COMMAND, USE `-INAME` INSTEAD OF `-NAME` FOR CASE-INSENSITIVE SEARCH\N#UNIX #LINUX #SHELL #FIND
  {}
  FROM A POSIX SHELL `$OLDPWD` HOLDS THE NAME OF THE PREVIOUS WORKING DIRECTORY:\NCD /TMP\NECHO YOU ARE HERE: $PWD\NECHO YOU WERE HERE: $OLDPWD\NCD $OLDPWD\N\N#UNIX #LINUX #SHELL #CD
  {}
  FROM A POSIX SHELL, "CD" IS A SHORTHAND FOR CD $HOME\N#UNIX #LINUX #SHELL #CD
  {}
  HOW TO MOVE HUNDREDS OF FILES IN NO TIME?\NUSING THE FIND COMMAND!\N\NHTTPS://YOUTU.BE/ZMEFXJYZAQK\N#UNIX #LINUX #MOVE #FILES #FIND\N#YESIKNOWIT
And that’s pretty all I wanted to show you today. I hope in future, you’ll use csplit to split files in Linux. If you’ve enjoyed this article and don’t forget to share and like it on your favorite social network!

No comments:

Post a Comment