111qqz的小窝

老年咸鱼冲锋!

我在公司的服务器上执行了sudo rm -rf /*

TL;DR

  • 依靠人的小心谨慎是不靠谱的,人总有失误的时候
  • 看了下docker volume的权限机制,貌似是从docker image中继承。
  • 写了两个脚本,用来把rm alias到mv,避免手滑

 

又是一个可以摸鱼的周五晚上,sensespider系统测试了一天,fix了几个Bug,似乎可以发布了。系统一直是部署在了docker中..这几天测试产生了不少结果文件在host的volume中… 看着不舒服,干脆删一下好了

嗯?怎么所有者是root。。。那我sudo一下,也没什么大不了的嘛

然而手滑了… 打了个 sudo rm -rf /*   …

 

提示无法删除/boot  device is busy…

吓了一跳,下意识Ctrl-C…

从新在本地ssh到服务器,发现已经登不上去了…报错在找不到sh

看了一下,果然服务器的/bin 目录已经被删干净了…

google了一些从rm中恢复文件的帖子…

试图用 sudo apt-get install  装一些工具包…

这个时候已经提示我找不到apt-get 了。。。

非常慌。花了3分钟思考了我到目前为止的一生

看了下scp命令还在,赶紧趁着这个终端回话还没关,把本地的/bin目录拷贝了上来。

试了下,ssh命令可以用了。 这样至少后续的修复(在不麻烦IT同事的情况下)不用跑机房了。有些镇定。

然后发现apt-get 命令还是用不了。。。思考了1分钟。。。

然后发现服务器用的是centos…….

再试了各种常用命令,试了docker相关的各种命令,都可以正常工作。

然而整个人都被吓傻了….睡了一觉才回过神。

又查了下docker volume权限的事情,发现挂载目录继承docker image中用户的权限是feature  Volumes files have root owner when running docker with non-root user.   那似乎就没办法了。

以及写了两个脚本,来避免手滑,分别是zsh环境和bash环境下的。

kkrm

 

 

docker network 与 本地 network 网段冲突

起因:

公司部署在hk的爬虫服务器突然挂掉了。后来发现只是在深圳办公区无法访问。排查后发现原因是docker的网络(包括docker network的subnet或者是某个容器的ip)与该host在内网的ip段相同,导致冲突。

排查过程:

有两个方面需要排查。一个是docker服务启动时的默认网络。

默认网络使用bridge桥接模式,是容器与宿主机进行通讯的默认办法。

修改默认网段可以参考 http://blog.51cto.com/wsxxsl/2060761

除此之外,还需要注意docker创建的network的网段。

使用docker network ls 命令查看当前的网络

然后可以使用docker inspect 查看每个network的详细信息。

也可以直接使用ip addr 来查看各种奇怪的虚拟网卡的ip,是否有前两位的地址和host的ip地址相同的。

解决办法:

本想在docker-compose up 时指定默认网络的subnet

结果发现好像并不支持?version 1.10.0 error on gateway spec

Was there any discussion on that? I do need to customize the network, because my company uses the 172.16.0.0/16 address range at some segments and Docker will simply clash with that by default, so every single Docker server in the whole company needs a forced network setting.

Now while upgrading my dev environment to Docker 1.13 it took me hours to stumble into this Github issue, because the removal of those options was completely undocumented.

So please, if I am working on a network which requires a custom docker subnet, how am I supposed to use Docker Compose and Docker Swarm?

最后用了个比较间接的办法。

先手动创建一个docker network,然后再在docker-compose的配置文件中指定。

 

 

 

 

 

 

记一次在 docker compose 中使用volume的踩坑记录

现象:

使用docker compose 挂载 named volume 无效(且没有错误提示)

排查过程:

一开始是没有使用docker-compose命令,直接使用docker run  -v 命令,挂载两个绝对路径,没有问题。

然后使用named volume,在这里使用了local-persist 插件,来指定数据卷(volume)在host上的位置。直接用docker run  -v 命令,依然没有问题。

接下里打算放到docker compose里面,发现并没有挂载成功。

但是在docker compose里面,挂载两个绝对路径是ok的。

于是怀疑是volume的问题

此时使用docker inspect 查看 用docker compose 启动起来的,挂载named volume的容器

发现mount里面,挂载的named volume并不是我在docker-compose.yml填写的名称,而是多了一个前缀,这个前缀恰好是docker-compose.yml 文件所在的目录名称。

查了一下,发现果然不止我一个人被坑到orz Docker-compose prepends directory name to named volumes

其实应该直接使用docker inspect来排查的…应该会更快找到问题

解决办法:

有几种解决办法:

  • 不手动创建volume,而是在docker-compose.yml中,设置volume的mountpoint
  • 在docker-compose.yml中,添加external: true的选项到 volume中,参考external

顺便附上我的docker-compose.yml文件

 

 

 

 

How to use Scrapy with Django Application(转自medium)

在meidum上看到一篇很赞的文章…无奈关键部分一律无法加载出来…挂了梯子也不行,很心塞…刚刚突然发现加载出来了…以防之后再次无法访问,所以搬运过来.

There are couple of articles on how to integrate Scrapy into a Django Application (or vice versa?). But most of them don’t cover a full complete example that includes triggering spiders from Django views. Since this is a web application, that must be our main goal.

What do we need ?

Before we start, it is better to specify what we want and how we want it. Check this diagram:

It shows how our app should work

  • Client sends a request with a URL to crawl it. (1)
  • Django triggets scrapy to run  a spider to crawl that URL. (2)
  • Django returns a response to tell Client that crawling just started. (3)
  • scrapy  completes crawling and saves extracted data into database. (4)
  • django fetches that data from database and return it to Client. (5)

Looks great and simple so far.

A note on that 5th statement

Django fetches that data from database and return it to  Client. (5)

Neither Django nor client don’t know when  Scrapy completes crawling. There is a callback method named  pipeline_closed, but it belongs to Scrapy project. We can’t return a response from Scrapy  pipelines. We use that method only to save extracted data into database.

 

Well eventually, in somewhere, we have to tell the client :

Hey! Crawling completed and i am sending you crawled data here.

There are two possible ways of this (Please comment if you discover more):

We can either use  web sockets to inform client when crawling completed.

Or,

We can start sending requests on every 2 seconds (more? or less ?) from client to check crawling status after we get the  "crawling started" response.

Web Socket solution sounds more stable and robust. But it requires a second service running separately and means more configuration. I will skip this option for now. But i would choose web sockets for my production-level applications.

Let’s write some code

It’s time to do some real job. Let’s start by preparing our environment.

Installing Dependencies

Create a virtual environment and activate it:

Scrapyd is a daemon service for running Scrapy spiders. You can discover its details from here.

python-scrapyd-api is a wrapper allows us to talk  scrapyd from our Python progam.

Note: I am going to use Python 3.5 for this project

Creating Django Project

Create a django project with an app named  main :

We also need a model to save our scraped data. Let’s keep it simple:

Add  main app into  INSTALLED_APPS in  settings.py And as a final step, migrations:

Let’s add a view and url to our  main app:

I tried to document the code as much as i can.

But the main trick is,  unique_id. Normally, we save an object to database, then we get its  ID. In our case, we are specifying its  unique_id before creating it. Once crawling completed and client asks for the crawled data; we can create a query with that  unique_id and fetch results.

And an url for this view:

Creating Scrapy Project\

It is better if we create Scrapy project under (or next to) our Django project. This makes easier to connect them together. So let’s create it under Django project folder:

Now we need to create our first spider from inside  scrapy_app folder:

i name spider as  icrawler. You can name it as anything. Look  -t crawl part. We specify a base template for our spider. You can see all available templates with:

Now we should have a folder structure like this:

Connecting Scrapy to Django

In order to have access Django models from Scrapy, we need to connect them together. Go to  settings.py file under  scrapy_app/scrapy_app/ and put:

That’s it. Now let’s start  scrapyd to make sure everything installed and configured properly. Inside  scrapy_app/ folder run:

$ scrapyd

This will start scrapyd and generate some outputs. Scrapyd also has a very minimal and simple web console. We don’t need it on production but we can use it to watch active jobs while developing. Once you start the scrapyd go to http://127.0.0.1:6800 and see if it is working.

Configuring Our Scrapy Project

Since this post is not about fundamentals of scrapy, i will skip the part about modifying spiders. You can create your spider with official documentation. I will put my example spider here, though:

Above is  icrawler.py file from  scrapy_app/scrapy_app/spiders. Attention to  __init__ method. It is important. If we want to make a method or property dynamic, we need to define it under  __init__ method, so we can pass arguments from Django and use them here.

We also need to create a  Item Pipeline for our scrapy project. Pipeline is a class for making actions over scraped items. From documentation:

Typical uses of item pipelines are:

  • cleansing HTML data
  • validating scraped data (checking that the items contain certain fields)
  • checking for duplicates (and dropping them)
  • storing the scraped item in a database

Yay!  Storing the scraped item in a database. Now let’s create one. Actually there is already a file named  pipelines.py inside  scrapy_project folder. And also that file contains an empty-but-ready pipeline. We just need to modify it a little bit:

And as a final step, we need to enable (uncomment) this pipeline in scrapy  settings.py file:

Don’t forget to restart  scraypd if it is working.

This scrapy project basically,

  • Crawls a website (comes from Django view)
  • Extract all URLs from website
  • Put them into a list
  • Save the list to database over Django models.

And that’s all for the back-end part. Django and Scrapy are both integrated and should be working fine.

Notes on Front-End Part

Well, this part is so subjective. We have tons of options. Personally I have build my front-end with  React . The only part that is not subjective is  usage of setInterval . Yes, let’s remember our options:  web sockets and  to send requests to server every X seconds.

To clarify base logic, this is simplified version of my React Component:


 

 

You can discover the details by comments i added. It is quite simple actually.

Oh, that’s it. It took longer than i expected. Please leave a comment for any kind of feedback.