投稿    登录
  《Python3网络爬虫开发实战》赠书活动正在进行中!详情请戳赠书活动!欢迎参与!非常感谢!

获取知乎问题答案并转换为MarkDown文件

Python 四毛 8870浏览 38评论

20170609 更新:

感谢一介草民与ftzz的反馈

(1) 修复中文路径保存问题

(2) 修复offset问题

(3) 修复第一个问题

来个好玩的东西

获取知乎问题答案并转换为MarkDown文件 获取知乎问题答案并转换为MarkDown文件 获取知乎问题答案并转换为MarkDown文件

20170607 更新:

(1) 感谢Ftzz提醒, 将图片替换为原图

(2) 将文件保存到本地,解决了最大的缺点问题,不用联网也可以看了

 

大家好,我是四毛。

写在前面的话

在开始前,给大家分享一个前段时间逛Github时看到的某个爬虫脚本中的内容:

所以,大家爬网站的时候,还是友善一点为好,且爬且珍惜啊。

好了,言归正传。

今天主要讲一下如何将某一个知乎问题的所有答案转换为本地MarkDown文件。

前期准备

python2.7

html2text

markdownpad(这里随意,只要可以支持md就行)

会抓包。。。。。

最重要的是你要有代理,因为知乎开始封IP了

1.什么是MarkDown文件

Markdown 是一种用来写作的轻量级「标记语言」,它用简洁的语法代替排版,而不像一般我们用的字处理软件 WordPages 有大量的排版、字体设置。它使我们专心于码字,用「标记」语法,来代替常见的排版格式。例如此文从内容到格式,甚至插图,键盘就可以通通搞定了。

恩,上面是我抄的,哈哈。想多了解的可以看看这里

2.为什么要将答案转为MarkDwon

因为。。。。。。懒,哈哈,开个玩笑。最重要的原因还是markdown看着比较舒服。平时写脚本的时候,也一直在思考一个问题,如何将一个文字与图片穿插的网页原始的保存下来呢。如果借助工具的话,那就很多了,CTRL+P  打印的时候,选择另存为PDF,或者搞个印象笔记,直接保存整个网页。那么,我们如何用爬虫实现呢?正好前几天看到了这个项目,仔细研究了一下,大受启发。

3.原理

原理说起来很简单:获取请求到的内容的BODY部分,然后重新构建一个HTML文件,接着利用html2text这个模块将其转换为markdown文件,最后对图片及标题按照markdown的格式做一些处理就好了。目前应用的场景主要是在知乎。

4.Show Code

    4.1获取知乎答案

写代码的时候,主要考虑了两种使用场景。第一,获取某一特定答案的数据然后进行转换;第二,获取某一个问题的所有答案进行然后挨个进行转换,在这里可以 通过赞同数来对要获取的答案进行质量控制。

    4.1.1、某一个特定答案的数据获取

url:https://www.zhihu.com/question/27621722/answer/48658220(前面那个是问题ID,后边的是答案ID)

这一数据的获取我这里分为了两个部分,第一部分请求上述网址,拿到答案主体数据以及赞同数,第二部分请求下面这个接口:

https://www.zhihu.com/api/v4/answers/48658220

为什么会这样?因为这个接口得到的答案正文数据不是完整数据,所以只能分两步了。

     4.1.2、某一个特定答案的数据获取

这一个数据就可以通过很简单的方式得到了,接口如下:

https://www.zhihu.com/api/v4/questions/27621722/answers?sort_by=default&include=data%5B%2A%5D.is_normal%2Cis_collapsed%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Cmark_infos%2Ccreated_time%2Cupdated_time%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset=3

返回的都是JSON数据,很方便获取。但是这里有一个地方需要注意,从这里面取的答案正文数据就是文本数据,不是一个完整的html文件,所以需要在构造一下。

    4.1.2、保存的字段

author_name 回答用户名
answer_id  答案ID
question_id 问题ID
question_title  问题
vote_up_count  赞同数
create_time  创建时间

答案主体

 

    4.2 Code

主脚本:zhihu.py

zhihu.py为主脚本,内容很简单,发起请求,调用解析函数进行解析,最后再进行保存。

解析函数脚本:parse_content.py

parse_content.py主要负责构造新的html,然后对其进行解析,获取数据。

5.测试结果展示

恩,下面还有,就不截图了。

6.缺点与不足

下面聊一聊这种方法的缺点:

这种方法的最大缺点就是:

一定要联网!

一定要联网!

一定要联网!

因为。。。。。。 在md文件中我们只是写了个图片的网址,这就意味着markdown的编辑器帮我们去存放图片的服务器上对这个图片进行了获取,所以断网也就意味着你看不到图片了;同时也意味着如果用户删除了这张图片,你也就看不到了。

但是,后来我又发现在markdownpad中将文件导出为html时,即使是断网了,依然可以看到全部的内容,包括图片,所以如果你真的喜欢某一个答案,保存到印象笔记肯定是不错的选择,PDF直接保存也不错,如果是使用了这个方法,记得转为html最好。

还有一个缺点就是html2text转换过后的效果其实并不是特别好,还是需要后期在进行处理的。

7.总结

代码还有很多可以改进之处,欢迎大家与我交流:QQ:549411552 (注明来自静觅)

国际惯例:代码在这

收工。

 

 

 

 

 

 

 

转载请注明:静觅 » 获取知乎问题答案并转换为MarkDown文件

喜欢 (29)or分享 (0)

我的个人微信公众号,联系我请直接在公众号留言即可~

扫码或搜索:进击的Coder

进击的Coder

微信公众号 扫一扫关注

想结交更多的朋友吗?

来进击的Coder瞧瞧吧

进击的Coder

QQ群号 99350970 立即加入

进击的Coder灌水太多?

这里是纯粹的技术领地

激进的Coder

QQ群号 627725766 立即加入

您的支持是博主写作最大的动力,如果您喜欢我的文章,感觉我的文章对您有帮助,请狠狠点击下面的

发表我的评论
取消评论
表情

Hi,您需要填写昵称和邮箱!

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
(38)个小伙伴在吐槽
  1. I simply needed to appreciate you all over again. I do not know the things that I might have gone through without the entire ways revealed by you regarding that question. It had become an absolute scary difficulty for me, however , spending time with the very well-written tactic you handled it made me to cry for fulfillment. I'm just happy for your advice and then believe you know what a powerful job you were putting in training people all through your web site. Most probably you've never encountered all of us.
    cheap jordans2018-12-24 17:05 回复
  2. Thanks a lot for giving everyone an exceptionally wonderful possiblity to read from this site. It is usually so superb and stuffed with a great time for me personally and my office friends to search your web site not less than three times in one week to see the latest things you will have. And lastly, I'm so usually fascinated with all the tremendous principles you serve. Selected 3 tips in this article are essentially the most impressive we have all had.
    cheap jordans2018-12-23 22:10 回复
  3. My spouse and i got absolutely comfortable Chris managed to do his investigation from the ideas he acquired from your site. It is now and again perplexing to just always be releasing helpful tips that many others have been making money from. And we all know we have the blog owner to be grateful to for this. The specific illustrations you made, the easy web site navigation, the relationships you can give support to promote - it's got most astonishing, and it's letting our son and us reckon that that article is excellent, which is particularly serious. Thanks for all the pieces!
    nike huarache2018-12-23 02:06 回复
  4. Thank you a lot for giving everyone an extraordinarily pleasant opportunity to check tips from this web site. It really is very pleasurable and full of a good time for me and my office acquaintances to search your web site on the least thrice per week to study the newest things you will have. And indeed, I am also actually happy with the effective tactics you serve. Selected two tips in this post are without a doubt the most efficient we've had.
    nike huarache2018-12-22 05:40 回复
  5. I truly wanted to develop a brief message in order to express gratitude to you for these remarkable concepts you are sharing on this site. My time-consuming internet research has at the end of the day been compensated with professional facts and strategies to go over with my visitors. I would say that we readers actually are rather blessed to exist in a great network with very many lovely people with valuable plans. I feel somewhat privileged to have discovered your entire web site and look forward to many more cool minutes reading here. Thank you once more for everything.
    nike huarache2018-12-21 09:49 回复
  6. I have to express thanks to you for bailing me out of this condition. As a result of checking through the online world and getting concepts which were not beneficial, I believed my entire life was over. Living without the solutions to the difficulties you have fixed all through your guide is a crucial case, and ones that would have in a wrong way damaged my career if I hadn't come across your blog. That expertise and kindness in taking care of every aspect was useful. I am not sure what I would've done if I hadn't come across such a stuff like this. I can at this time look ahead to my future. Thank you so much for this impressive and effective help. I won't hesitate to suggest your web site to anybody who wants and needs care on this area.
    nike huarache2018-12-20 14:37 回复
  7. I wanted to send you this little note in order to give many thanks the moment again for the beautiful guidelines you've featured on this site. It's certainly extremely generous with you to give unreservedly exactly what a few individuals might have offered for sale for an electronic book to earn some profit for their own end, principally considering that you might have tried it if you desired. Those pointers in addition served as the great way to know that other people online have similar interest like my very own to see somewhat more regarding this matter. I am certain there are some more pleasant times ahead for individuals that start reading your blog post.
    nike air max 20182018-12-19 17:47 回复
  8. whyblackpeoplemeethere
    AlvinEpids2018-12-06 18:14 回复
1 2 3