Was there any discussion on that? I do need to customize the network, because my company uses the 172.16.0.0/16 address range at some segments and Docker will simply clash with that by default, so every single Docker server in the whole company needs a forced network setting.
Now while upgrading my dev environment to Docker 1.13 it took me hours to stumble into this Github issue, because the removal of those options was completely undocumented.
So please, if I am working on a network which requires a custom docker subnet, how am I supposed to use Docker Compose and Docker Swarm?
There are couple of articles on how to integrate Scrapy into a Django Application (or vice versa?). But most of them don’t cover a full complete example that includes triggering spiders from Django views. Since this is a web application, that must be our main goal.
What do we need ?
Before we start, it is better to specify what we want and how we want it. Check this diagram:
It shows how our app should work
Client sends a request with a URL to crawl it. (1)
Django triggets scrapy to run a spider to crawl that URL. (2)
Django returns a response to tell Client that crawling just started. (3)
scrapy completes crawling and saves extracted data into database. (4)
django fetches that data from database and return it to Client. (5)
Looks great and simple so far.
A note on that 5th statement
Django fetches that data from database and return it to Client. (5)
Neither Django nor client don’t know when
Scrapy completes crawling. There is a callback method named
pipeline_closed, but it belongs to Scrapy project. We can’t return a response from Scrapy
pipelines. We use that method only to save extracted data into database.
Well eventually, in somewhere, we have to tell the client :
Hey! Crawling completed and i am sending you crawled data here.
There are two possible ways of this (Please comment if you discover more):
We can either use
web sockets to inform client when crawling completed.
We can start sending requests on every 2 seconds (more? or less ?) from client to check crawling status after we get the
"crawling started" response.
Web Socket solution sounds more stable and robust. But it requires a second service running separately and means more configuration. I will skip this option for now. But i would choose web sockets for my production-level applications.
Let’s write some code
It’s time to do some real job. Let’s start by preparing our environment.
# Get requests are for getting result of a specific crawling task
# We were passed these from past request above. Remember ?
# They were trying to survive in client side.
# Now they are here again, thankfully. <3
# We passed them back to here to check the status of crawling
# And if crawling is completed, we respond back with a crawled data.
# Here we check status of crawling that just started a few seconds ago.
# If it is finished, we can query from database and get results
# If it is not finished we can return active status
# Possible results are -> pending, running, finished
# this is the unique_id that we created even before crawling started.
I tried to document the code as much as i can.
But the main trick is,
unique_id. Normally, we save an object to database, then we get its
ID. In our case, we are specifying its
unique_id before creating it. Once crawling completed and client asks for the crawled data; we can create a query with that
unique_id and fetch results.
That’s it. Now let’s start
scrapyd to make sure everything installed and configured properly. Inside
scrapy_app/ folder run:
This will start scrapyd and generate some outputs. Scrapyd also has a very minimal and simple web console. We don’t need it on production but we can use it to watch active jobs while developing. Once you start the scrapyd go to http://127.0.0.1:6800 and see if it is working.
Configuring Our Scrapy Project
Since this post is not about fundamentals of scrapy, i will skip the part about modifying spiders. You can create your spider with official documentation. I will put my example spider here, though:
# We are going to pass these args from our django view.
# To make everything dynamic, we need to override them inside __init__ method
icrawler.py file from
scrapy_app/scrapy_app/spiders. Attention to
__init__ method. It is important. If we want to make a method or property dynamic, we need to define it under
__init__ method, so we can pass arguments from Django and use them here.
We also need to create a
Item Pipeline for our scrapy project. Pipeline is a class for making actions over scraped items. From documentation:
Typical uses of item pipelines are:
cleansing HTML data
validating scraped data (checking that the items contain certain fields)
checking for duplicates (and dropping them)
storing the scraped item in a database
Storing the scraped item inadatabase. Now let’s create one. Actually there is already a file named
scrapy_project folder. And also that file contains an empty-but-ready pipeline. We just need to modify it a little bit:
unique_id=crawler.settings.get('unique_id'),# this will be passed from django view
# And here we are saving our crawled data with django models.
And as a final step, we need to enable (uncomment) this pipeline in scrapy
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
Don’t forget to restart
scraypd if it is working.
This scrapy project basically,
Crawls a website (comes from Django view)
Extract all URLs from website
Put them into a list
Save the list to database over Django models.
And that’s all for the back-end part. Django and Scrapy are both integrated and should be working fine.
Notes on Front-End Part
Well, this part is so subjective. We have tons of options. Personally I have build my front-end with
React . The only part that is not subjective is
usage of setInterval . Yes, let’s remember our options:
web sockets and
tosend requests toserver everyXseconds.
To clarify base logic, this is simplified version of my React Component:
// send a post request to client when form button clicked
// django response back with task_id and unique_id.
// We have created them in views.py file, remember?
// Update the state with new task and unique id
// ####################### HERE ########################
// After updating state,
// i start to execute checkCrawlStatus method for every 2 seconds
// Check method's body for more details
// ####################### HERE ########################