Indexing bulk documents to Elasticsearch using jq

Hey, all! I recently started using Elasticsearch and I have to tell you this, I love it already! So here’s this blog focusing on importing/indexing JSON file to Elasticsearch.

If you have a lot of documents to index, you can use the Bulk API of Elastic search to send them in batches. However you need to follow bulk format to successfully execute it otherwise you might come across "Malformed content, found extra data after parsing: START_OBJECT" error. The Bulk API expects the following newline delimited JSON (NDJSON) structure:

{ "index" : { "_index" : "test", "_id" : "1" } } 
{ "field1" : "value1" }
{ "index" : { "_index" : "test", "_id" : "2" } }
{ "field2" : "value2" }

Checkout the link for further reference on it.

Now, you can use jq tool to change your JSON file into the bulk format on command line. jq is an extremely amazing commandline JSON processor. To use it, first make sure you have jq installed.

  • On Debian systems you can install it via sudo apt-get install jq
  • On MacOS, you can install it via brew install jq

Then, execute the following command to get a new JSON file in a bulk format.

cat info.json | jq -c '{"index": {"_index": "students", "_type": "doc"}}, .' > students.json

Here, we pipe the contents of info.json file with the -c option, to construct compact rather than pretty-printed JSON. So if your original info. json file something like this:

{"enrol_number": 1, "firstname": "Drake", "lastname": "Wilson", "age": 16, "gender": "M"}
{"enrol_number": 2, "firstname": "Scarlet", "lastname": "Rose", "age": 14, "gender": "F"}

The output gets written into a new file students.json:

{"index": {"_index": "students", "_type": "doc"}}
{"enrol_number": 1, "firstname": "Drake", "lastname": "Wilson", "age": 16, "gender": "M"}
{"index": {"_index": "students", "_type": "doc"}}
{"enrol_number": 2, "firstname": "Scarlet", "lastname": "Rose", "age": 14, "gender": "F"}

Note: Since we are not specifying any id, a document id is automatically generated. The next step is to index the data of students.json file into the students index using a _bulk request:

curl -XPOST "localhost:9200/students/_bulk?pretty&refresh" -H "Content-Type: application/json" --data-binary "@students.json"

This is how we JSON format our file the way Elasticsearch’s Bulk API expects it and post it into the Elasticsearch. Using the cat indices API, we get the information of the index like shard count, document count, deleted document count, primary store size, etc.

curl "localhost:9200/_cat/indices?v"

Thanks for reading! 🙂

Outreachy internship : End of an amazing journey

My outreachy internship with Open Information Security Foundation has ended and this marks one of the last blog posts for it. I can’t believe I’ve come so far. I still remember the day when getting into outreachy was just a dream to me. And when I made that dream come true and got selected, I was so afraid that I got thoughts like “what if I’m not good enough? What if I couldn’t complete a task? What if I’m not showing any progress?” and now here I am successfully completing the internship, feeling more confident and motivated.

In this blog post I’ll point out my challenges I faced and what I’ve gained throughout my journey.

The Internship started with simpler tasks so that I could get a better understanding of the codebase and know things which were still unclear to me. Well, this time was more of a learning phase for me.
During the second phase of the internship my mentors gave a bit of the complex task compared to the previous ones. During this time, I started feeling lost with my progress, struggling and figuring out correct approach to solve a problem. One of the challenge I had during the internship was in keeping the communication clear with the mentors and trying to explain them where I am stuck at, which now, I think has improved a little.
The third phase of the internship was to add important features to the project. I remember there was time when I decided to give up on the task after trying every possible way to find the correct approach to the problem while I was stuck into it. I decided to take a break and talk to my friends. Next day I was able to find a way to approach the solution. Sometimes sitting on the problem for hours does not give you the solution. You need to relax and refresh your mind and handle things patiently.

I’ve mentioned about the description of the tasks I’ve completed in the first two phases in my previous blog posts. For the third phase of internship I’ve contributed to the following parts of the project :

  • Log a warning if index is old: Added a check to log a warning if the index file is older than 2 weeks and user needs to update it by running suricata-update update-sources.
  • Added “check-versions” subcommand: Added a “suricata-update check-versions” subcommand that checks the version of suricata and logs if the versions are up to date, outdated or EOL.
  • Separated log messages to stderr and stdout: Currently, all the messages in suricata-update are logged to stderr. Changes are made to split between log messages where regular output (INFO, DEBUG) goes to stdout whereas ERRORS, WARNINGS and CRITICAL messages goes to stderr.
  • Added a check to apply color if output stream on tty
  • Added “no-checksum” option: Added a –no-checksum option to the add-source command of suricata-update. It will skip downloading the checksum URL if the source is configured with no-checksum true.

You can find all my contributions to the project on the github page.

The internship has helped me build many skills. I never thought if I could ever write a blog post. I just loved the way we got a theme for every alternate week taking time to brainstorm and think outside the box for every blog post! Apart from this, it has helped in communicating through chat and emails effectively.
In technical terms, I’ve learned about so many things – Logging, argparse and operating system interfaces in python, contributing to the documentation, writing clean code, etc.
Additionally, I’ve learned to ask questions and be less afraid of making mistakes. I have gained so much motivation to reach everything I want, with hard work and consistency. I have finally let go of all my apprehensions to learn newer technologies, frameworks, skills! I can’t emphasize how big an achievement it is for me. And I could do this over a couple of months as opposed to a year and a half that I’d been trying for.

I had an awesome time during these 3 months. It’s really hard to bid farewell to this internship.
Thank you to all my mentors Shivani, Jason and Victor for giving me this opportunity and for all the guidance and support. You all are amazing people. 🙂 Hoping to learn more things from you all in future and meeting you in person.

Thank you outreachy organizers for bringing this opportunity, hosting weekly zulip chat sessions and for awesome blogging assignments and lastly thank you to everyone else who too worked hard to make all this happen!

Thanks for reading! 🙂

Outreachy Internship: Improve suricata-update

Hi there!
It’s been 5 weeks since I started my journey of outreachy internship working with OISF on the project suricata-update. In this blog post, I’ll be explaining my project and what I’ve been working on so far.

About my internship project

The name of the project I’m working on is “Improve suricata-update”. Suricata-update is a subproject of Suricata. Since suricata is too big of a project, I’ll try to explain it from what I’ve read and understood.

What is Suricata?

Suricata is a free and open source network threat detection engine that provides capabilities including real-time intrusion detection (IDS), inline intrusion prevention (IPS), network security monitoring and offline pcap processing. It inspects the network traffic using a powerful and extensive rules and signature language and performs very well for detection of complex threats and attacks.

What is Suricata-update?

Suricata-update is a tool for downloading and managing the rulesets for Suricata. It makes it easier for users to find available rule sets, as well as allowing rule writers to make their rules more discoverable. These rulesets are defined by some security sources like proofpoint, secureworks, etc.

Features of suricata-update include:

  • Default to Emerging Threats Open ruleset if no configuration provided.
  • Automatic discovery of Suricata version for use in rule set URLs.
  • Flowbit resolution
  • Enable, disable, drop and modify filters that should be familiar to users of Pulled Pork and Oinkmaster.
  • Easy enabling of additional rule sets from the index.

Suricata Rules

A rule/signature is a notation made up of certain keywords and options in a language that Suricata understands so that it is possible to detect and/or prevent a threat to the system that Suricata is monitoring. It consists of the following:

  • The action, that determines what happens when the signature matches
  • The header, defining the protocol, IP addresses, ports and direction of the rule.
  • The rule options, defining the specifics of the rule.

An example of a rule taken from an open database of Emerging Threats is as follows:

drop tcp $HOME_NET any -> $EXTERNAL_NET any (msg:”ET TROJAN Likely Bot Nick in IRC (USA +..)”; flow:established,to_server; flowbits:isset,is_proto_irc; content:”NICK “; pcre:”/NICK .*USA.*[0-9]{3,}/i”; reference:url,; classtype:trojan-activity; sid:2008124; rev:2;)

In this example drop is the action, tcp $HOME_NET any -> $EXTERNAL_NET anyis the header and rest of the part are the options. To know more about rule sets please follow this link.

Suricata-update options

Suricata-update provides various command line options and arguments to pass parameters to the programs. These are the command-line options for suricata-update:

The detailed functionality of these options can be seen here.

Improve suricata-update project

For my internship, I am working on improving suricata-update. Tech stack of the project is Python. I have completed the following tasks for the project:

  • Fixed a bug related to –no-merge command
  • Cleanup unused and scattered imports
  • Improved permission warnings for non-root users
  • Updated docs to setup directories with correct permissions
  • Separated code for rule matching
  • Logged warning on duplicate Sid

I have mentioned about description of some of these tasks in my previous blogs.
I’m still working on the following tasks:

  • Separate code for command line parsers: Parsers module is broken into smaller functions based on different parsers with reduced repetitions of add_argument by storing the arguments in a tuple and adding the parsers to the loop thus making code cleaner and compact.
  • Adding a –offline command line option: Currently, suricata-update downloads the rules going online over the net and there’s no such command as “offline” preventing from downloading over the net and using cached files. Therefore, I am working on adding a command line option –offline which uses locally cached latest version of rules without trying to download rules from sources.
  • Redo variable and function names reserved for Python: Working on changing conflicting variable and function names which suricata-update uses like “filter” which are reserved for use in Python standard modules.

For the next task I’ll be working on checking versions of suricata and suricata-update by adding a “–check-versions” command.

By contributing to the project I am able to learn the internals of Suricata. There have been many hurdles along the way throughout my journey, but there are always such supporting and helping mentors for guiding me.

I hope I was able to give an overview of the project. Thanks for reading. Stay tuned for more updates.

Everybody struggles: One step closer to learning

This blog post is about the struggles I faced in the last three weeks, how I reached out for help and what I learned from those struggles.

Few days before my outreachy internship with OISF started, I started bonding with the community. Seeing the community members so passionate to improve the community and working hard for it, I got so much motivation from them. But at the same time I was scared that will I be able to give my best?

That’s when the real struggle began..

I started reading the documentation and understanding the codebase for my project suricata-update. But soon, I realized that there is a difference between understanding the codebase and the documentation by merely reading it and when you actually try to play around with it.
Playing here refers to fixing a bug, adding a little feature or even writing your own little scripts when you don’t understand certain behaviors in the code. It may not be the best way for everyone but it did help me the most. 🙂

and the struggle continues…

For the initial days as a intern with OISF, my mentor assigned me a task in which I had to continue with one of the task I had claimed during the application period. Now, this was going to be a hard part for me. I was already working on the task for a few weeks but was not able to find the solution.

I would like to give an idea about the issue I was stuck on.

When suricata-update runs with a non-root user, it gives an ugly traceback. The task was to check the permissions of the directory and log the errors. Here is the output from the redmine of what a traceback looks like.

12/3/2019 -- 11:50:44 - <Warning> -- No suricata application binary found on path.
12/3/2019 -- 11:50:44 - <Info> -- Using Suricata configuration /etc/suricata/suricata.yaml
12/3/2019 -- 11:50:44 - <Info> -- Using /etc/suricata/rules for Suricata provided rules.
12/3/2019 -- 11:50:44 - <Info> -- Using default Suricata version of 4.0.0
12/3/2019 -- 11:50:44 - <Warning> -- No index exists, will use bundled index.
12/3/2019 -- 11:50:44 - <Warning> -- Please run suricata-update update-sources.
12/3/2019 -- 11:50:44 - <Info> -- Fetching
Traceback (most recent call last):    
  File "./bin/suricata-update", line 33, in <module>
  File "/home/victor/sync/devel/suricata-update/suricata/update/", line 1458, in main
  File "/home/victor/sync/devel/suricata-update/suricata/update/", line 1312, in _main
    files = load_sources(suricata_version)
  File "/home/victor/sync/devel/suricata-update/suricata/update/", line 997, in load_sources
    Fetch().run(url, files)
  File "/home/victor/sync/devel/suricata-update/suricata/update/", line 395, in run
    fetched = self.fetch(url)
  File "/home/victor/sync/devel/suricata-update/suricata/update/", line 385, in fetch
    raise err
IOError: [Errno 13] Permission denied: '/var/lib/suricata/update/cache/5c25dfc84c3d879cd2f90fda6337b9dd-traffic-id.rules'

I was stuck on reproducing this traceback as I was getting errors about directory permissions. Here is the output of what traceback I was getting looks like.

28/4/2019 -- 00:38:12 - <Info> -- Using data-directory /usr/local/var/lib/suricata.
28/4/2019 -- 00:38:12 - <Info> -- Using Suricata configuration /usr/local/etc/suricata/suricata.yaml
28/4/2019 -- 00:38:12 - <Info> -- Using /usr/local/etc/suricata/rules for Suricata provided rules.
28/4/2019 -- 00:38:12 - <Info> -- Found Suricata version 5.0.0-dev at /usr/local/bin/suricata.
28/4/2019 -- 00:38:12 - <Info> -- Loading /usr/local/etc/suricata/suricata.yaml
28/4/2019 -- 00:38:12 - <Error> -- [ERRCODE: SC_ERR_FATAL(171)] - failed to open file: /usr/local/etc/suricata/suricata.yaml: Permission denied
Traceback (most recent call last):
  File "./bin/suricata-update", line 33, in <module>
  File "/home/vagisha/suricata/suricata-update/suricata/update/", line 1458, in main
  File "/home/vagisha/suricata/suricata-update/suricata/update/", line 1290, in _main
    config.get("suricata-conf"), suricata_path=suricata_path)
  File "/home/vagisha/suricata/suricata-update/suricata/update/", line 96, in load
  File "/usr/lib/python2.7/", line 223, in check_output
    raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '['/usr/local/bin/suricata', '-c', u'/usr/local/etc/suricata/suricata.yaml', '--dump-config']' returned non-zero exit status 1

I started googling out stuff about fixing file permissions and came up with using os.access method. My mentor suggested to use some other approach as os.access is not the right way to deal with permissions so I started randomly putting open checks with a pass statement trying to fix the errors about directory permissions instead of looking up the exact places where the exception was coming from and making changes there. Though this way wasn’t confusing, it didn’t get me much help on solving the issue.

I shared these details with my mentor. She helped me figure out where I was going wrong by giving some little hints. After submitting 6 PRs on the issue I was finally able to find the right approach and hence reproducing the error and finishing up the task. Here is a GitHub link to it. Since the error was related to non-root user, I also managed to set the directories with correct permissions on my system by creating a group, adding users to that group and changing the permissions as mentioned here. Luckily, my mentor is always there to help me as many times as I reach out to her for help. She is patient enough to review my PR for several iterations even when I am not following the correct approach.

A little advice from my side to you when you are stuck for more than a couple of hours:

  • Reach out for help from your mentor/s or on public community forums and talk about it. Sometimes even the act of asking for help can help you find a solution!
  • Make notes so you don’t forget to ask something that you’re unsure about.

Asking out for help might be one of the toughest thing to do at some point in time whether the problem is a difficult or not. But at one point or the other everyone has been there (recently got to know this from an encouraging mail from my mentor that even she gets stuck sometimes and has to ask out for help from other community members) because we all want our project to be a success, learn from it and see ourselves grow. 🙂

So it’s completely OK to get stuck. Everybody struggles!

Don’t hesitate to ask questions!
Stay tuned until my next blog!

Have a great day! 🙂

A walk-through to Outreachy

This blog post is for all those people from underrepresented groups in tech who wants to get involved in learning and contributing to free and open source software. Outreachy is a great initiative to get more women and members of other underrepresented groups involved in Free & Open Source Software.

So if you are someone who is wondering about getting into Outreachy or someone who has been bothered with questions regarding Outreachy, this post will help you get on the tracks right!

What is Outreachy?

Outreachy provides three-month paid internships for people from groups underrepresented in tech. Interns get to work closely with guidance/supervision from assigned mentors from Free and Open Source Software communities (like Open Information Security Foundation, Mozilla, Wikimedia, Linux Kernel to name a few) remotely. The internship rounds runs twice a year from May-August and December-March. Interns are paid a stipend of $5,500 and have a $500 travel stipend available to them to attend conferences or events. Interns work on projects ranging from programming, user experience, documentation, illustration, and graphical design, to data science. The internship period spans for 3 months and interns go through 3 evaluations within this period.

To know more about Outreachy internship program, please check the official website here.

Getting into Outreachy

There are some steps you need to be perfect in to maximize your chances of selection :

Start Early

Applying to Outreachy is not a one-step process but a relatively involved process. A couple of months before each round begins, the list of participating organizations/projects is announced. Try to start as early as possible after the names of organizations are announced. Ensure that you have some prior experience with the skills mentioned in the required skills section of the project. To increase the chance of getting selected it’s always a good idea to apply for 2 organizations (in my case I focused only on 1 organization).

Contribute and communicate

After the projects are chosen it’s time to join the mailing list, slack, IRC channel of organizations and  introduce yourself to the community. Remember that communication is one of the key factors which will play an important role in the decision of your selection and give a positive impression on mentors about you as an active student who’s willing to put an effort for a project. Ask mentors for any kind of help whether it’s regarding installation, documentation or finding the good-first bugs to contribute on.

Contribution in an open source project counts as anything from fixing/reporting a bug, writing documentation, making the code optimized or adding a new feature to the project. Always make sure to ask before starting to work on an issue to ensure nobody else is working on the same. This avoids redundant work and makes it easy for the mentors to provide feedback. Once you have submitted a PR, wait for mentor’s review on it. It may be possible to send the same PR iteratively if mentors suggest some changes. Don’t get disheartened as this will only help you to learn more. Start with a slightly harder issue, once your previous PR is merged/accepted and keep contributing!

Final application

Final application is basically your whole application. Start with a draft application on a plain text file as soon as you submit your first PR and keep iteratively improving it. You need to record all your contributions to this section including your open/accepted/merged PRs. Also, you need to fill out sets of questions with lots of information about the experience with FOSS, contribution to the project, time commitment, etc. Some organizations also ask to fill out community specific questions to fulfill their own requirements for the project. Your proposal is going to be an important key towards your selection to ensure that you are putting in extra efforts towards making it detailed and informative. Try not to keep it more generic and be honest about the questions.

Write a blog post

It’s a good idea to blog about your experience while applying for the internship. It may also include the details about what you are working on, things that you find surprising about working in open source, techniques for working efficiently or even talking about what you find confusing. You can also share your achievements throughout the application process.

Don’t stop!

After you’re done with submitting the final application, don’t stop contributing. Even though it’s not necessary for you to keep contributing but it will only maximize your chances of selection. Keep interacting with the organization members and make the best of the time until the results are announced.

My Outreachy result :

Accepted 🙂 🙂

Tip for future applicants:

It’s not the number of PRs or lines of code you write that matters but the effort you put in the overall application process from beginning till the end.

Thanks for reading! 🙂
Find out my experience with the application process here.

I will be sharing more about my journey with Outreachy, so stay tuned!
Feel free to connect on twitter .