How to remove duplicate lines from text files on Linux/UNIX systems
One day, I was preparing a blacklist to block the porn websites on a web proxy website called “anaproxy.com”. After checking the server access logs, I could extract all the accessed urls from it’s access log, I did some operations on the urls to extract only the domain “actually I extracted the domain and it’s subdomain”, I separated those domains in alphabetic files for further processing. When I checked the first file I found it contains a huge number of duplicated lines “domains and subdomains” which forced me the search for an accurate way to remove only the duplicated lines and only keep one line of each duplicated lines, Why I needed accurate method for removing the duplicated lines?
Simply, because I’m filtering a porn websites domain, I need to block all of them and not allowing any single domain.
How can I remove those duplicated lines my text files?
We can remove the duplicate lines using three options, I’ll list them here and show the difference and the accuracy of each one of them, and of course I’ll recommend one the one I use.
I’ll use the file called “A-Websites” which contains all collected porn websites start with letter “a”, before start our first option, let’s count the line inside this file, I ran the following command:
$ wc -l A-Websites
464 A-Websites
We’ve 464 porn domain starts with letter “a”, some of them are of course duplicated. Let’s start.
Option 1: Using awk
“pattern scanning and processing language” Command
Simply, run the following command:
$ awk '!seen[$0]++' A-Websites > A-Websites-AWK
I saved it’s output with name “A-Websites-AWK”, remember that “A-Websites” contains 464 line, now I check the total lines of the “A-Websites-AWK”:
$ wc -l A-Websites-AWK
447 A-Websites-AWK
Hooray, I did it I removed all duplicated lines “removed 17 duplicated domain”, wait a second, when I opened the file for checking it I found some duplicated lines, Here’s what I found “not real porn domains, I give some example domains”:
$ cat A-Websites-AWK
Amazon.com
amazon.com
anaproxy.com
mimastech.com
......
Mimastech.com Anaproxy.com
Too bad, awk
is case sensitive, I still have some duplicated domains. But the good news is awk
works fine on unsorted lines inside the text files.
Let’s move the the next option we’ve.
Option 2: Using uniq
“report or omit repeated lines” Command
uniq
command is our second option, we still work on the same file “A-Websites” which contains 464 lines with some duplicated. There are some notes about using uniq
command:
uniq
does not detect repeated lines unless they are adjacent.- We must sort the lines inside our text file first before using
uniq
So, it only compare two adjacent lines, and this is a very important point. First we will sort our “A-Websites” file using the following command:
$ sort A-Websites > A-Websites-Sorted
Now, we ready to use uniq
command on our sorted file “A-Websites-Sorted”, but let’s check the number of lines/domains inside it using the following command:
$ wc -l A-Websites-Sorted
464 A-Websites-Sorted
We still have the 464 lines we working on. We have three cases of using uniq
command:
- Case 1: Using
uniq
Without Any Options.
The simple form of using uniq
is through pipe the output of cat
command to uniq
without any options, see the following command:
$ cat A-Websites-Sorted | uniq > A-Websites-Uniq-only
Check the new file “A-Websites-Uniq-only” line, by running:
$ wc -l A-Websites-Uniq-only
447 A-Websites-Uniq-only
we see a 447 line exists on this file, which is the same result we got from option 1 “using awk
command”, Now, I’ll open the “A-Websites-Uniq-only” for seeing check using cat
command as follow “again not real porn domains, I give some example domains” :
$ cat A-Websites-Uniq-only
amazon.com
Amazon.com .......
anaproxy.com
Anaproxy.com ....... mimastech.com Mimastech.com
Too bad, uniq
is case sensitive, I still have some duplicated domains, and uniq
works only on sorted lines inside the text files “we used the sort command in the first of this option 2”. If you use it on an unsorted files, it’ll give you a very wrong output.
- Case 2: Using
uniq
With-i
Option.
To overcome the duplicated lines with differences only some uppercase letters and some lowercase letters “remove the case sensitivity from using uniq
in the above case”, we use uniq
with option “-i",
this will give us exactly what we want.
Again, we still working on “A-Websites-Sorted”, Now run the following command:
$ cat A-Websites-Sorted |uniq -i > A-Websites-Uniq-i
Check the new file “A-Websites-Uniq-i” line, by running:
$ wc -l A-Websites-Uniq-i 444 A-Websites-Uniq-i
we see a 444 line exists on this file, which is three lines less than output from using uniq
only. After seeing check using cat
command for this file, I didn’t find any duplicated lines “porn domains in this case” with case sensitive letters”, “uniq -i
” removed the duplicated lines with upper case letters and only kept one lines only of each duplicated lines. Is this the correct output for what we want?, Answer will be found in the conclusion part.
- Case 3: Using
uniq
With-u
Option.
Sure, you saw before reading this post some online articles give command “uniq -u
” as the best solution for removing duplicated lines, they may be right or wrong, here’s we show the difference between using “uniq -u
” and “uniq -i
“, let’s use “uniq -u
” against our file and check it’s output, run the following command:
$ cat A-Websites-Sorted |uniq -u > A-Websites-Uniq-u
Check the new file “A-Websites-Uniq-u” line, by running:
$ wc -l A-Websites-Uniq-u 430 A-Websites-Uniq-u
Amazing, we see a 430 line exists on this file, which is 17 lines less than output from using uniq
only. After seeing check using cat
command for this file, I didn’t find any duplicated lines, and didn’t find any duplicated domain found in our first case “using uniq
only” i.e “didn’t find neither anaproxy.com nor Anaproxy.com”. uniq -u
removes all duplicated lines and not keep any single line from each duplicated group “it outputs only the unique lines“, this is not what I and you “in most cases” want, we want to remove all duplicated porn domains and keep only one for our firewall to block it
“uniq -u
” will give you a wrong output, or a correct output “in a very rare cases. according to your needs”, as it removes all duplicated, case insensitive lines and not output any single line from them.
Option 3: Using sort
“sort lines of text files” Command
Last but not least, We can use sort
command with options “fu
“. A single command that gives us exactly what we need, remember in order to use uniq
you must use sort
first as uniq
works only on sorted lines inside a file. So, why I use two commands to get what I need, I can only use one “sort
” with some options, and this command what we recommend for all of you.
Let’s use “sort -fu
” against our original file “the unsorted file” which called “A-Websites” and check it’s output, run the following command:
$ sort -fu A-Websites > A-Websites-Sort-fu
Check the new file “A-Websites-Sort-fu” line, by running:
$ wc -l A-Websites-Sort-fu 444 A-Websites-Sort-f
we see a 444 line exists on this file. After seeing check using cat
command for this file, I didn’t find any duplicated lines “porn domains in this case” with case sensitive letters”, “sort -fu
” removed the duplicated lines with upper case letters and only kept one lines only of each duplicated lines. It gives us exactly the same output when using “uniq -i
” command, but in one step. This is the command we use and recommend it for you. A single command uses less processing power on our machine than two commands especially in case of large files.
Conclusion
“sort -fu
” in most/all cases is your command that saves your time ans resources. A one command which gives you a clear output. In rare case if you need to remove all duplicated lines group and not keep even a single line, then “uniq -u
” on a sorted text file is your command.
I hope this article is good enough for you.
See you in other articles.
If You Appreciate What We Do Here On Mimastech, You Should Consider:
- Stay Connected to: Facebook | Twitter | Google+
- Support us via PayPal Donation
- Subscribe to our email newsletters.
- Tell other sysadmins / friends about Us - Share and Like our posts and services
We are thankful for your never ending support.