Loading Text Files into Elasticsearch

Recently, I needed to dump a large number of text files into Elasticsearch. It's not a big deal to index a text file in Elasticsearch, but there were thousands and thousands. This was an experiment, and wouldn't be used again in this form as I knew the input would change drastically as soon as it was possible to query the data in Elasticsearch. What I needed was something quick and reliable to load some data.

I had a bunch of text files in a directory, and Elasticsearch has a wonderful REST interface that is callable from the command line using curl. Some of the text files had an apostrophe, which would need to be removed to load into Elasticsearch (losing this apostrophe would not affect the results of the experiment). So the requirements boiled down to sending a bunch of slightly cleaned up text files into Elasticsearch without a lot of effort. Why not write a small bash script?

The Script

#!/bin/bash
echo "#!/bin/bash" > ./loadit.sh
echo "" >> ./loadit.sh
echo "curl -XDELETE http://localhost:9200/newindex?pretty" >> ./loadit.sh
echo "" >> ./loadit.sh
echo "curl -XPUT http://localhost:9200/newindex?pretty" >> ./loadit.sh
echo "" >> ./loadit.sh
COUNTER=0
FILES="./*.txt"
for f in $FILES
do
  COUNTER=$[COUNTER + 1]
  CONTENT="`cat $f | sed "s/'//g"`"
  echo "curl -XPUT http://localhost:9200/newindex/doc/$COUNTER -d \
       '{\"content\" : \"$CONTENT\"}'" >> ./loadit.sh
  echo "" >> ./loadit.sh
done
chmod 777 ./loadit.sh
/bin/bash ./loadit.sh

That's 18 lines of code, but a lot is going on! This script generates a script and then executes the script it generated. Code generation, at any level, is a neat trick.

Many lines take advantage of the simple echo command (example: echo "#!/bin/bash" > ./loadit.sh). This well-worn territory for anyone familiar with writing a bash script, with the only twist being the > and >> usage. In the first line, it echo's with > to overwrite the output from any previous runs if output existed.

Lines 4 and 6 write out the commands to delete in index in Elasticsearch, if it exists, and to create the index in Elasticsearch. This commands, and all subsequent commands, assume that the script would be executed from the Elasticsearch host. This may seem like a trivial distinction, but when using the attachments plugin of Elasticsearch, the documents would have to be on the host so this approach may come in handy when using the attachments plugin.

The loop employs some standard bash scripting, iterating over every file within the directory that ends in .txt while the COUNTER variable introduced on line 8 merely increases one each iteration. This is straightforward bash scripting and only important in this case due to the use of COUNTER as a document index in Elasticsearch as used in line 14.

Line 13 uses the stream editor sed, arguably one of the most powerful command line features ever! The sed command is awesome, providing robust regular expression matching and replacement. This usage is so simple to match an apostrophe and replace it with nothing, but sed just excels at this work.

Then on line 14 it all comes together with the generation of the curl command to write the document into Elasticsearch within the newly created index. While it would also be possible to load a file directly into Elasticsearch (replacing the text after -d with @ theFileName.txt), since the removal of the apostrophe was necessary the content was put in-line with the command.

Once the loop is done, the permissions on the file are set to 777 (restrict this if necessary) and the newly generated script is executed, thus loading thousands of documents into Elasticsearch.