Saturday, April 9, 2016

Python web scraping OSCTI

Essentially Python scripts can be used to scrape a bunch of open source cyber threat intelligence sources like malwaredomain.com, to extract easy low level IOCs (lower in the IOCs pyramid of pain) and make this raw output available to be consumed by both inline solutions (NGFW, NGIPS, etc) as well as monitoring solutions infrastructure like SIEM. The raw output could be formatted into CEF and streamed as syslogs or in simple formats such as XML, csv etc

With rising deployments of Endpoint Detection and Response (EDR) aka the next generation endpoint security solution, this kind of raw output could also be formatted into yara rules or some other format which can be consumed by these EDR solutions.

A typical methodology could be to first visit the source which need to be scraped and use the web browser to inspect the page structure by looking into the HTML tag tree structure


Once the structure is analyzed, we can begin coding. We will rely on two common Python packages to do the heavy lifting, Requests and Beautiful Soup.

Check the documentation out at: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#



We need to load the page by requesting (request.get). The request once received is parsed using BeautifulSoup(r.text). Once the HTML is parsed we can simply look for the relevant tag within which the IOCs of interests are enclosed. In our case we are looking specifically for malicious ip addresses.

Once the strings of the parsed <td> tags are acquired. It is iterated and ip addresses are extracted from the array



We do the same thing for all the remaining 32 pages by observing the URL structure and iterate through all of the pages by replacing the parameter in the URL with a number in the range. Here I have hard coded the number 33 as I saw there are 33 pages of data in the malwaredomain.com source. However, this range could be made dynamic by scraping the last page number on the first page using similar method as above and using that for the page iteration loop.




Another point to note is that once this list is consumed by SIEM etc, then the script can be modified to generate the raw feed based on date/time stamp so that data duplication does not occur. This would require to also inspect the date/time stamp <td> item.

Output:



No comments:

Post a Comment