阻塞状态指程序未得到所需计算资源时被挂起的状态。程序在等待某个操作完成期间,自身无法继续干别的事情,则称该程序在该操作上是阻塞的。 常见的阻塞形式有:网络 I/O 阻塞、磁盘 I/O 阻塞、用户输入阻塞等。阻塞是无处不在的,包括 CPU 切换上下文时,所有的进程都无法真正干事情,它们也会被阻塞。如果是多核 CPU 则正在执行上下文切换操作的核不可被利用。
Coroutine: <coroutine object execute at 0x10e0f7830> After calling execute Task: <Task pending coro=<execute() running at demo.py:4>> Number: 1 Task: <Task finished coro=<execute() done, defined at demo.py:4> result=1> After calling loop
Coroutine: <coroutine object execute at 0x10aa33830> After calling execute Task: <Task pending coro=<execute() running at demo.py:4>> Number: 1 Task: <Task finished coro=<execute() done, defined at demo.py:4> result=1> After calling loop
发现其效果都是一样的。
3.2 绑定回调
另外我们也可以为某个 task 绑定一个回调方法,来看下面的例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
import asyncio import requests
async def request(): url = 'https://www.baidu.com' status = requests.get(url) return status
Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Cost time:15.049368143081665
Waiting forhttp://127.0.0.1:5000 Waiting forhttp://127.0.0.1:5000 Waiting forhttp://127.0.0.1:5000 Waiting forhttp://127.0.0.1:5000 Waiting forhttp://127.0.0.1:5000 Cost time:15.048935890197754 Task exception was never retrieved future: <Task finished coro=<request() done, defined at demo.py:7> exception=TypeError("object Response can't be used in 'await' expression",)> Traceback (most recent call last): File "demo.py", line 10, in request status = await requests.get(url) TypeError: object Response can't be used in 'await' expression
Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Cost time:15.134317874908447
Waiting forhttp://127.0.0.1:5000 Waiting forhttp://127.0.0.1:5000 Waiting forhttp://127.0.0.1:5000 Waiting forhttp://127.0.0.1:5000 Waiting forhttp://127.0.0.1:5000 Get response from http://127.0.0.1:5000 Result: Hello! Get response from http://127.0.0.1:5000 Result: Hello! Get response from http://127.0.0.1:5000 Result: Hello! Get response from http://127.0.0.1:5000 Result: Hello! Get response from http://127.0.0.1:5000 Result: Hello! Cost time:3.0199508666992188
一些日常工具在这里我就不一一列举了,大部分使用 Mac 的小伙伴都会安装,比如 QQ、微信、Chrome 浏览器、网易云音乐、迅雷等等,这些在 Windows 上也几乎都是必备软件,这里就不再展开说明了。
效率工具
效率工具顾名思义,可以方便和简化 Mac 的操作,提高生产工作效率的工具,下面推荐几款我比较常用的。
Alfred
首推 Alfred,可以说是 Mac 必备软件,利用它我们可以快速地进行各种操作,大幅提高工作效率,如快速打开某个软件、快速打开某个链接、快速搜索某个文档,快速定位某个文件,快速查看本机 IP,快速定义某个色值,几乎你能想到的都能对接实现。 这些快速功能是怎么实现的呢?实际上是 Alfred 对接了很多 Workflow,我们可以使用 Workflow 方便地进行功能扩展,一些比较优秀的 Workflow 已经有人专门做过整理了,可以参见:https://github.com/zenorocha/alfred-workflows。 推荐指数:★★★★★
Todoist
大家肯定也在使用各种 Todo List 的软件,这种软件其实也是五花八门,经过我本人试用,我觉得 Todoist 这款软件是最方便的。 它支持各种类型的任务定制,还可以设置分组、优先级、Deadline、执行人员、提醒、协作、效率统计等功能。另外它的各个平台支持真是异常地全啊,网页、PC、移动端就不用说了,都必须有的,另外它还有浏览器插件版、电邮版、可穿戴设备(如 Apple Watch、Google Wear)版,另外他还可以和 Mac 的日历事件进行同步,日历添加的事件也会自动添加到 Todoist 里面,非常方便,是目前我体验过的最好用的一款。 这款软件个人推荐购买专业版解锁全部功能,一个月 3 刀,但个人觉得确实非常值。 推荐指数:★★★★☆
Paste
Mac 上默认只有一个粘贴板,当我们新复制了一段文字之后,如果我们想再找寻之前复制的历史记录就找不到了,这其实是很反人类的。 好在 Paste 这款软件帮我们解决了这个问题,它可以保存我们粘贴板的历史记录,等需要粘贴某个内容的时候只需要呼出 Paste 历史粘贴板,然后选择某个特定的内容粘贴就好了,另外它还支持文本格式调整粘贴板分类和搜索,还可以支持快速便捷粘贴。有了它,妈妈再也不用担心我的粘贴板丢失了! 推荐指数:★★★★★
Synergy
工作时我会使用公司的台式机,是 Windows 系统,另外自己的个人笔记本 Mac 也会放在旁边,两台 PC 有时候会交替使用,但是我总不能配两套键盘和鼠标吧,这样就显得累赘了,而且也没那么多地方放啊。 有了 Synergy,我们可以将两台 PC 关联,实现键盘鼠标共享。我们可以使用一套键盘和鼠标来操作两台 PC,注意这是两个完全独立的 PC,各自有各自的屏幕和系统,使用 Synergy 我们可以做到一套键鼠同时控制两台电脑,鼠标可以直接从一台电脑的屏幕滑动到另一台电脑屏幕上,同时键盘、粘贴板也都是共享的。 设想这么个情景,我在我的台式机 Windows 上打开了一个页面,需要让我输入一个很长的序列号,而这个序列号又恰巧存在 Mac 上,这时如果有了 Synergy 将二者关联,我们只需要把鼠标从 Windows 的屏幕上直接滑动到 Mac 的屏幕上,选中序列号,然后键盘按下复制的快捷键,然后再把鼠标移回 Windows,粘贴即可,一气呵成。而不必再想办法发消息传输了,大大提高效率。 推荐指数:★★★★
用了 Mac,我们在使用移动硬盘的时候可能会遇到一个无法传输数据(如拷贝文件)的问题,这是因为部分移动硬盘是 NTFS 格式的,而 Mac 的磁盘不是这个格式,因此就会导致二者之间无法拷贝文件。有一个解决方法就是使用 Tuxera NTFS For Mac,有了它,我们就可以比较顺利地拷贝文件了。 另外还有其他品牌的 NTFS For Mac 软件,也可以尝试使用一下。 推荐指数:★★★★☆
VMware、Parallels Desktop
用了 Mac 之后,难免会有些情况下也还会不得不使用 Windows,毕竟很多软件可能只有 Windows 版本,但用 Mac 我就不推荐装双系统了,直接装虚拟机就好了,Mac 上虚拟机软件有两款比较好用,一个就是著名的 VMware,另一个就是 Parallels Desktop,这两款我都使用过,觉得都非常不错,现在用的是 VMware。 推荐指数:★★★★☆
CleanMyMac
很多时候用着用着磁盘就不够用了,如果你的 Mac 硬盘是 512GB 的倒还好,256GB 的你就得多注意一下了,另外 1T 定制版土豪请绕道,这款软件不适合你。 CleanMyMac 可以非常方便地帮助我们扫描缓存、大文件、废纸篓、残留项等内容,清理这些内容之后我们可以节省很多硬盘空间,另外它还支持软件卸载和残留清扫功能,可以帮我们非常干净地移除 Mac 中的软件,目前应该是出到第三版了,非常推荐。 推荐指数:★★★★☆
编辑器
既然做程序开发嘛,不配置好自己的开发环境怎么行,下面推荐一下我平常使用的开发软件。
JetBrains
我目前使用的 IDE 是 JetBrains 全家桶,目前我编写 Python 比较多,所以主要使用 PyCharm,另外写前端的时候也会使用 WebStorm,写 Java 就用 IntelliJ IDEA,C、C++ 用 CLion,PHP 的话就用 PhpStorm,Ruby 的话就用 RubyMine,其他的语言用的就少了,就没有装了。 当然有的小伙伴会说 JetBrains 系列的 IDE 需要购买啊?我只想说,国人的力量是无穷的,在网上其实可以搜到各种破解方法,如 License Server 验证,你能搜到各种五花八门的 License Server。另外 JetBrains 还有专门的 Educational Programs,可以来这里申请:https://www.jetbrains.com/education/programs/?fromMenu,学生、老师或教育工作者可以使用学校的 edu 邮箱申请免费的 License,如果你还是学生的话,那么申请是十分方便的,因为我还是个学生,我目前就在使用学生套餐,当然如果你已经工作的话也可以向正在上学的弟弟妹妹们借一下嘛。 总之我个人比较喜欢 JetBrains 全家桶,不论是页面风格还是开发习惯我都比较喜欢,推荐使用。 推荐指数:★★★★☆
对于开发者来说,这个软件几乎是 Mac 上必备的一个软件,它的官方简介就是 “The missing package manager for macOS”,算是 Mac 上的一个软件包平台,它里面包含着非常多的 Mac 开发软件包,比如 Python、PHP、Redis、MySQL、RabbitMQ、HBase 等等,几乎你能想到的开发软件都集成在里面了,堪称神器! 它的安装也非常简单,参见这里:https://brew.sh/,另外 HomeBrew 也有对应的图形界面,叫做 CakeBrew,如果不喜欢命令行操作的话可以使用 CakeBrew 来代替。 推荐指数:★★★★★
import gensim import jieba import numpy as np from scipy.linalg import norm
model_file = './word2vec/news_12g_baidubaike_20g_novel_90g_embedding_64.bin' model = gensim.models.KeyedVectors.load_word2vec_format(model_file, binary=True)
defvector_similarity(s1, s2): defsentence_vector(s): words = jieba.lcut(s) v = np.zeros(64) for word in words: v += model[word] v /= len(words) return v v1, v2 = sentence_vector(s1), sentence_vector(s2) return np.dot(v1, v2) / (norm(v1) * norm(v2))
# If these values are missing, it means we want to use the defaults. optional = { # TODO: Use custom prefixes for this settings to note that are # specific to scrapy-redis. 'queue_key': 'SCHEDULER_QUEUE_KEY', 'queue_cls': 'SCHEDULER_QUEUE_CLASS', 'dupefilter_key': 'SCHEDULER_DUPEFILTER_KEY', # We use the default setting name to keep compatibility. 'dupefilter_cls': 'DUPEFILTER_CLASS', 'serializer': 'SCHEDULER_SERIALIZER', } # 从setting中获取配置组装成dict(具体获取那些配置是optional字典中key) for name, setting_name in optional.items(): val = settings.get(setting_name) if val: kwargs[name] = val
# Support serializer as a path to a module. if isinstance(kwargs.get('serializer'), six.string_types): kwargs['serializer'] = importlib.import_module(kwargs['serializer']) # 或得一个Redis连接 server = connection.from_settings(settings) # Ensure the connection is working. server.ping()
return cls(server=server, **kwargs)
@classmethod def from_crawler(cls, crawler): instance = cls.from_settings(crawler.settings) # FIXME: for now, stats are only supported from this constructor instance.stats = crawler.stats return instance
def open(self, spider): self.spider = spider
try: # 根据self.queue_cls这个可以导入的类 实例化一个队列 self.queue = load_object(self.queue_cls)( server=self.server, spider=spider, key=self.queue_key % {'spider': spider.name}, serializer=self.serializer, ) except TypeError as e: raise ValueError("Failed to instantiate queue class '%s': %s", self.queue_cls, e)
if self.flush_on_start: self.flush() # notice if there are requests already in the queue to resume the crawl if len(self.queue): spider.log("Resuming crawl (%d requests scheduled)" % len(self.queue))
# 在Redis有序集合中数值越小优先级越高(就是会被放在顶层)所以这个位置是取得 相反数 score = -request.priority # We don't use zadd method as the order of arguments change depending on # whether the class is Redis or StrictRedis, and the option of using # kwargs only accepts strings, not bytes. # ZADD 是添加进有序集合 self.server.execute_command('ZADD', self.key, score, data)
defpop(self, timeout=0): """ Pop a request timeout not support in this queue class 有序集合不支持超时所以就木有使用timeout了 这个timeout就是挂羊头卖狗肉 """ """从有序集合中取出一个Request""" # use atomic range/remove using multi/exec """使用multi的原因是为了将获取Request和删除Request合并成一个操作(原子性的)在获取到一个元素之后 删除它,因为有序集合 不像list 有pop 这种方式啊""" pipe = self.server.pipeline() pipe.multi() # 取出 顶层第一个 # zrange :返回有序集 key 中,指定区间内的成员。0,0 就是第一个了 # zremrangebyrank:移除有序集 key 中,指定排名(rank)区间内的所有成员 0,0也就是第一个了 # 更多请参考Redis官方文档 pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0) results, count = pipe.execute() if results: return self._decode_request(results[0])
logger.info('This is a log info') logger.debug('Debugging') logger.warning('Warning exists') logger.info('Finish')
在这里我们首先引入了 logging 模块,然后进行了一下基本的配置,这里通过 basicConfig 配置了 level 信息和 format 信息,这里 level 配置为 INFO 信息,即只输出 INFO 级别的信息,另外这里指定了 format 格式的字符串,包括 asctime、name、levelname、message 四个内容,分别代表运行时间、模块名称、日志级别、日志内容,这样输出内容便是这四者组合而成的内容了,这就是 logging 的全局配置。 接下来声明了一个 Logger 对象,它就是日志输出的主类,调用对象的 info() 方法就可以输出 INFO 级别的日志信息,调用 debug() 方法就可以输出 DEBUG 级别的日志信息,非常方便。在初始化的时候我们传入了模块的名称,这里直接使用 name 来代替了,就是模块的名称,如果直接运行这个脚本的话就是 main,如果是 import 的模块的话就是被引入模块的名称,这个变量在不同的模块中的名字是不同的,所以一般使用 name 来表示就好了,再接下来输出了四条日志信息,其中有两条 INFO、一条 WARNING、一条 DEBUG 信息,我们看下输出结果:
1 2 3
2018-06-0313:42:43,526 - __main__ - INFO - This is a log info 2018-06-0313:42:43,526 - __main__ - WARNING - Warning exists 2018-06-0313:42:43,526 - __main__ - INFO - Finish
2018-06-0314:53:36,467 - __main__ - INFO - This is a log info 2018-06-0314:53:36,468 - __main__ - WARNING - Warning exists 2018-06-0314:53:36,468 - __main__ - INFO - Finish
2018-06-0315:13:44,895 - __main__ - INFO - This is a log info 2018-06-0315:13:44,947 - __main__ - WARNING - Warning exists 2018-06-0315:13:44,949 - __main__ - INFO - Finish
2018-06-0316:55:56,259 - main - INFO - Main Info 2018-06-0316:55:56,259 - main - ERROR - Main Error 2018-06-0316:55:56,259 - main.core - INFO - Core Info 2018-06-0316:55:56,259 - main.core - ERROR - Core Error
try: result = 5 / 0 except Exception as e: # bad logging.error('Error: %s', e) # good logging.error('Error', exc_info=True) # good logging.exception('Error')
memory:The memory to query; usually the output of an RNN encoder. 即解码时用到的上文信息,维度需要是 [batch_size, max_time, context_dim]。这时我们观察一下父类 _BaseAttentionMechanism 的初始化方法,实现如下:
probability_fn:A callable function which converts the score to probabilities. 计算概率时的函数,必须是一个可调用的函数,默认使用 softmax(),还可以指定 hardmax() 等函数。
score_mask_value:The mask value for score before passing into probability_fn. The default is -inf. Only used if memory_sequence_length is not None. 在使用 probability_fn 计算概率之前,对 score 预先进行 mask 使用的值,默认是负无穷。但这个只有在 memory_sequence_length 参数定义的时候有效。
dtype:The data type for the query and memory layers of the attention mechanism. 数据类型,默认是 float32。
def _bahdanau_score(processed_query, keys, normalize): dtype = processed_query.dtype # Get the number of hidden units from the trailing dimension of keys num_units = keys.shape[2].value or array_ops.shape(keys)[2] # Reshape from [batch_size, ...] to [batch_size, 1, ...] for broadcasting. processed_query = array_ops.expand_dims(processed_query, 1) v = variable_scope.get_variable( "attention_v", [num_units], dtype=dtype) if normalize: # Scalar used in weight normalization g = variable_scope.get_variable( "attention_g", dtype=dtype, initializer=math.sqrt((1. / num_units))) # Bias added prior to the nonlinearity b = variable_scope.get_variable( "attention_b", [num_units], dtype=dtype, initializer=init_ops.zeros_initializer()) # normed_v = g * v / ||v|| normed_v = g * v * math_ops.rsqrt( math_ops.reduce_sum(math_ops.square(v))) return math_ops.reduce_sum(normed_v * math_ops.tanh(keys + processed_query + b), [2]) else: return math_ops.reduce_sum(v * math_ops.tanh(keys + processed_query), [2])
if attention_layer_size is not None: attention_layer_sizes = tuple(attention_layer_size if isinstance(attention_layer_size, (list, tuple)) else (attention_layer_size,)) if len(attention_layer_sizes) != len(attention_mechanisms): raise ValueError("If provided, attention_layer_size must contain exactly one integer per attention_mechanism, saw: %d vs %d" % (len(attention_layer_sizes), len(attention_mechanisms))) self._attention_layers = tuple(layers_core.Dense(attention_layer_size, name="attention_layer", use_bias=False, dtype=attention_mechanisms[i].dtype) for i, attention_layer_size in enumerate(attention_layer_sizes)) self._attention_layer_size = sum(attention_layer_sizes) else: self._attention_layers = None self._attention_layer_size = sum(attention_mechanism.values.get_shape()[-1].value for attention_mechanism in attention_mechanisms) for i, attention_mechanism in enumerate(self._attention_mechanisms): attention, alignments = _compute_attention(attention_mechanism, cell_output, previous_alignments[i], self._attention_layers[i] ifself._attention_layerselse None) alignment_history = previous_alignment_history[i].write(state.time, alignments) if self._alignment_history else()
alignment_history:即是否将之前的 alignments 存储到 state 中,以便于后期进行可视化展示。
Options: --update Update Pipenv & pip to latest. --where Output project home information. --venv Output virtualenv information. --py Output Python interpreter information. --envs Output Environment Variable options. --rm Remove the virtualenv. --bare Minimal output. --completion Output completion (to be eval'd). --man Display manpage. --three / --two Use Python 3/2 when creating virtualenv. --python TEXT Specify which version of Python virtualenv should use. --site-packages Enable site-packages for the virtualenv. --jumbotron An easter egg, effectively. --version Show the version and exit. -h, --help Show this message and exit.
Usage Examples: Create a new project using Python 3.6, specifically: $ pipenv --python 3.6
Install all dependencies for a project (including dev): $ pipenv install --dev
Create a lockfile containing pre-releases: $ pipenv lock --pre
Show a graph of your installed dependencies: $ pipenv graph
Check your installed dependencies for security vulnerabilities: $ pipenv check
Install a local setup.py into your virtual environment/Pipfile: $ pipenv install -e .
Commands: check Checks for security vulnerabilities and against PEP 508 markers provided in Pipfile. graph Displays currently–installed dependency graph information. install Installs provided packages and adds them to Pipfile, or (if none is given), installs all packages. lock Generates Pipfile.lock. open View a given module in your editor. run Spawns a command installed into the virtualenv. shell Spawns a shell within the virtualenv. uninstall Un-installs a provided package and removes it from Pipfile. update Uninstalls all packages, and re-installs package(s) in [packages] to latest compatible versions.
接下来我们首先验证一下当前的项目是没有创建虚拟环境的,调用如下命令:
1
pipenv --venv
结果如下:
1
No virtualenv has been created forthisproject yet!
Warning: the environment variable LANG is not set! We recommend setting this in ~/.profile (or equivalent) for proper expected behavior. Creating a virtualenv for this project… Using /usr/local/bin/python3 tocreate virtualenv… ⠋Running virtualenv with interpreter /usr/local/bin/python3 Using base prefix '/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6' New python executable in /Users/CQC/.local/share/virtualenvs/PipenvTest-VSTVh89E/bin/python3.6 Also creating executable in /Users/CQC/.local/share/virtualenvs/PipenvTest-VSTVh89E/bin/python Installing setuptools, pip, wheel...done. Virtualenv location: /Users/CQC/.local/share/virtualenvs/PipenvTest-VSTVh89E
实际上这也和 virtualenv 激活的流程一样,也是调用了类似 source venv/bin/activate 方法将这个路径加到全局环境变量最前面,这样就会优先调用该路径下的 python、python3、python3.6 可执行文件了。 这时候我们会发现命令行的样子就变了,前面多了一个 (PipenvTest-VSTVh89E) 的标识,代表当前我们已经切换到了虚拟环境下。 这时我们用 which 或 where 命令查看一下 Python 可执行文件的路径,命令如下:
1 2 3 4 5 6
(PipenvTest-VSTVh89E) CQC-MAC% which python3 /Users/CQC/.local/share/virtualenvs/PipenvTest-VSTVh89E/bin/python3 (PipenvTest-VSTVh89E) CQC-MAC% which python3.6 /Users/CQC/.local/share/virtualenvs/PipenvTest-VSTVh89E/bin/python3.6 (PipenvTest-VSTVh89E) CQC-MAC% which python /Users/CQC/.local/share/virtualenvs/PipenvTest-VSTVh89E/bin/python
中文分词,即 Chinese Word Segmentation,即将一个汉字序列进行切分,得到一个个单独的词。表面上看,分词其实就是那么回事,但分词效果好不好对信息检索、实验结果还是有很大影响的,同时分词的背后其实是涉及各种各样的算法的。 中文分词与英文分词有很大的不同,对英文而言,一个单词就是一个词,而汉语是以字为基本的书写单位,词语之间没有明显的区分标记,需要人为切分。根据其特点,可以把分词算法分为四大类:
最大匹配法(MM)。基本思想是:假设自动分词词典中的最长词条所含汉字的个数为 i,则取被处理材料当前字符串序列中的前 i 个字符作为匹配字段,查找分词词典,若词典中有这样一个 i 字词,则匹配成功,匹配字段作为一个词被切分出来;若词典中找不到这样的一个 i 字词,则匹配失败,匹配字段去掉最后一个汉字,剩下的字符作为新的匹配字段,再进行匹配,如此进行下去,直到匹配成功为止。统计结果表明,该方法的错误率 为 1/169。
逆向最大匹配法(RMM)。该方法的分词过程与 MM 法相同,不同的是从句子(或文章)末尾开始处理,每次匹配不成功时去掉的是前面的一个汉字。统计结果表明,该方法的错误率为 1/245。
SnowNLP: Simplified Chinese Text Processing,可以方便的处理中文文本内容,是受到了 TextBlob 的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和 TextBlob 不同的是,这里没有用 NLTK,所有的算法都是自己实现的,并且自带了一些训练好的字典。GitHub地址:https://github.com/isnowfy/snownlp。
pip3 -V pip 9.0.1 from /usr/local/anaconda3/lib/python3.6/site-packages (python 3.6)
1 2 3 4 5 6 7
which python3 /usr/local/anaconda3/bin/python3 python3 Python 3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0] on linux Type "help", "copyright", "credits"or"license"for more information. >>>
如果存在之前的旧版本,可以选择先卸载,以免和新的 CUDA 版本产生冲突,在 /usr/local/cuda/bin 目录下有一个 uninstallcuda*.pl 文件,可以直接运行卸载,命令如下:
1
sudo ./uninstall_cuda_*.pl
这样即可将 CUDA 全部卸载。 接下来我们再下载 CUDA 9.0,注意 TensorFlow 1.5 和 1.6 版本依然只是兼容 CUDA 9.0,没有兼容 CUDA 9.1,所以不要下载 9.1,CUDA 9.0 的下载地址是:https://developer.nvidia.com/cuda-90-download-archive,然后依次勾选好系统的版本,如图所示: 这里我们选择 Linux-x86_64-Ubuntu-16.04-runfile 的配置,然后点击 Base Installer 部分的 Download 按钮,下载 CUDA 9.0 安装包。 对应的下载命令是:
The NVIDIA CUDA Toolkit provides command-line and graphical tools for building, debugging and optimizing the performance Do you accept the previously read EULA? accept/decline/quit: accept
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.81? (y)es/(n)o/(q)uit: n
Install the CUDA 9.0 Toolkit? (y)es/(n)o/(q)uit: y
Enter Toolkit Location [ default is /usr/local/cuda-9.0 ]:
Do you want to install a symbolic link at /usr/local/cuda? (y)es/(n)o/(q)uit: y
Install the CUDA 9.0 Samples? (y)es/(n)o/(q)uit: y
Enter CUDA Samples Location [ default is /home/cqc ]:
Installing the CUDA Toolkit in /usr/local/cuda-9.0 ...
最后如果出现这样的提示,就证明 CUDA 安装好了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Driver: Not Selected Toolkit: Installed in /usr/local/cuda-9.0 Samples: Installed in /home/cqc, but missing recommended libraries
Please make sure that - PATH includes /usr/local/cuda-9.0/bin - LD_LIBRARY_PATH includes /usr/local/cuda-9.0/lib64, or, add /usr/local/cuda-9.0/lib64 to /etc/ld.so.conf and run ldconfig as root
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-9.0/bin
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-9.0/doc/pdf for detailed information on setting up CUDA.
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required for CUDA 9.0 functionality to work. To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file: sudo <CudaInstaller>.run -silent -driver
Linux version 4.4.0-112-generic (buildd@lgw01-amd64-010) (gcc version 5.4.020160609 (Ubuntu 5.4.0-6ubuntu1~16.04.5) ) #135-Ubuntu SMP Fri Jan 1911:48:36 UTC 2018
import re str = '我的个人邮箱是cqc@cuiqingcai.com,个人博客是cuiqingcai.com,个人公众号是进击的Coder' results = re.findall('个人(.*?)是(.*?)(?=,|\Z)', str) for result in results: print(result[0] + ': ' + result[1])
data = tf.constant([1, 2, 3]) x = tf.layers.Input(tensor=data) print(x)
结果如下:
1
Tensor("Const:0", shape=(3,), dtype=int32)
可以看到它可以自动计算出其 shape 和 dtype。
batch_normalization
此方法是批量标准化的方法,经过处理之后可以加速训练速度,其定义在 tensorflow/python/layers/normalization.py,论文可以参考:http://arxiv.org/abs/1502.03167 “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”。
deftext2vec(text): """ text to one-hot vector :param text: source text :return: np array """ if len(text) > CAPTCHA_LENGTH: returnFalse vector = np.zeros(CAPTCHA_LENGTH * VOCAB_LENGTH) for i, c in enumerate(text): index = i * VOCAB_LENGTH + VOCAB.index(c) vector[index] = 1 return vector
defvec2text(vector): """ vector to captcha text :param vector: np array :return: text """ ifnot isinstance(vector, np.ndarray): vector = np.asarray(vector) vector = np.reshape(vector, [CAPTCHA_LENGTH, -1]) text = '' for item in vector: text += VOCAB[np.argmax(item)] return text
# generate data x and y for i in range(DATA_LENGTH): text = get_random_text() # get captcha array captcha_array = generate_captcha(text) # get vector vector = text2vec(text) data_x.append(captcha_array) data_y.append(vector)
# write data to pickle if not exists(DATA_PATH): makedirs(DATA_PATH)
x = np.asarray(data_x, np.float32) y = np.asarray(data_y, np.float32) with open(join(DATA_PATH, 'data.pkl'), 'wb') as f: pickle.dump(x, f) pickle.dump(y, f)
# input Layer with tf.variable_scope('inputs'): # x.shape = [-1, 60, 160, 3] x, y_label = iterator.get_next() keep_prob = tf.placeholder(tf.float32, []) y = tf.cast(x, tf.float32) # 3 CNN layers for _ in range(3): y = tf.layers.conv2d(y, filters=32, kernel_size=3, padding='same', activation=tf.nn.relu) y = tf.layers.max_pooling2d(y, pool_size=2, strides=2, padding='same') # y = tf.layers.dropout(y, rate=keep_prob)
# 2 dense layers y = tf.layers.flatten(y) y = tf.layers.dense(y, 1024, activation=tf.nn.relu) y = tf.layers.dropout(y, rate=keep_prob) y = tf.layers.dense(y, VOCAB_LENGTH)
NOTE: While the classes are mutually exclusive, their probabilities need not be. All that is required is that each row of labels is a valid probability distribution. If they are not, the computation of the gradient will be incorrect.
from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait from urllib.parse import quote
upstream splash { server 41.159.27.223:8050 weight=4; server 41.159.27.221:8050 weight=2; server 41.159.27.9:8050 weight=2; server 41.159.117.119:8050 weight=1; }
import requests from urllib.parse import quote import re
lua = ''' function main(splash, args) local treat = require("treat") local response = splash:http_get("http://httpbin.org/get") return treat.as_string(response.body) end '''
通过 HAR 的结果可以看到,Splash 执行了整个网页的渲染过程,包括 CSS、JavaScript 的加载等过程,呈现的页面和我们在浏览器中得到的结果完全一致。
那么,这个过程由什么来控制呢?重新返回首页,可以看到实际上是有一段脚本,内容如下:
1 2 3 4 5 6 7 8 9
function main(splash, args) assert(splash:go(args.url)) assert(splash:wait(0.5)) return { html = splash:html(), png = splash:png(), har = splash:har(), } end
这个脚本实际上是用 Lua 语言写的脚本。即使不懂这个语言的语法,但从脚本的表面意思,我们也可以大致了解到它首先调用go()方法去加载页面,然后调用wait()方法等待了一定时间,最后返回了页面的源码、截图和 HAR 信息。
functionmain(splash, args) local example_urls = {"www.baidu.com", "www.taobao.com", "www.zhihu.com"} local urls = args.urls or example_urls local results = {} for index, url inipairs(urls) do local ok, reason = splash:go("http://" .. url) if ok then splash:wait(2) results[url] = splash:png() end end return results end
此属性可以设置图片是否加载,默认情况下是加载的。禁用该属性后,可以节省网络流量并提高网页加载速度。但是需要注意的是,禁用图片加载可能会影响 JavaScript 渲染。因为禁用图片之后,它的外层 DOM 节点的高度会受影响,进而影响 DOM 节点的位置。因此,如果 JavaScript 对图片节点有操作的话,其执行就会受到影响。
functionmain(splash, args) local ok, reason = splash:go{"http://httpbin.org/post", http_method="POST", body="name=Germey"} if ok then return splash:html() end end
functionmain(splash, args) localget_div_count = splash:jsfunc([[ function () { var body = document.body; var divs = body.getElementsByTagName('div'); return divs.length; } ]]) splash:go("https://www.baidu.com") return ("There are %s DIVs"):format( get_div_count()) end
functionmain(splash, args) assert(splash:autoload("https://code.jquery.com/jquery-2.1.3.min.js")) assert(splash:go("https://www.taobao.com")) local version = splash:evaljs("$.fn.jquery") return'JQuery version: ' .. version end
functionmain(splash) local treat = require('treat') assert(splash:go("http://quotes.toscrape.com/")) assert(splash:wait(0.5)) local texts = splash:select_all('.quote .text') local results = {} for index, text inipairs(texts) do results[index] = text.node.innerHTML end return treat.as_array(results) end
这里我们通过 CSS 选择器选中了节点的正文内容,随后遍历了所有节点,将其中的文本获取下来。
运行结果如下:
1 2 3 4 5 6 7 8 9 10 11
SplashResponse: Array[10] 0: "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”" 1: "“It is our choices, Harry, that show what we truly are, far more than our abilities.”" 2: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” 3: "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”" 4: "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”" 5: "“Try not to become a man of success. Rather become a man of value.”" 6: "“It is better to be hated for what you are than to be loved for what you are not.”" 7: "“I have not failed. I've just found 10,000 ways that won't work.”" 8: "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”" 9: "“A day without sunshine is like, you know, night.”"
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait
from selenium import webdriver from selenium.webdriver.common.byimportBy from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC
def get_images(json): if json.get('data'): for item in json.get('data'): title = item.get('title') images = item.get('image_detail') for image in images: yield { 'image': image.get('url'), 'title': title }
def save_image(item): ifnot os.path.exists(item.get('title')): os.mkdir(item.get('title')) try: response = requests.get(item.get('image')) if response.status_code == 200: file_path = '{0}/{1}.{2}'.format(item.get('title'), md5(response.content).hexdigest(), 'jpg') ifnot os.path.exists(file_path): withopen(file_path, 'wb') as f: f.write(response.content) else: print('Already Downloaded', file_path) except requests.ConnectionError: print('Failed to Save Image')
最后,只需要构造一个offset数组,遍历offset,提取图片链接,并将其下载即可:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
from multiprocessing.pool import Pool
def main(offset): json = get_page(offset) for item in get_images(json): print(item) save_image(item)
GROUP_START = 1 GROUP_END = 20
if __name__ == '__main__': pool = Pool() groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)]) pool.map(main, groups) pool.close() pool.join()
db = pymysql.connect(host='localhost', user='root', password='123456', port=3306, db='spiders') cursor = db.cursor() sql = 'CREATE TABLE IF NOT EXISTS students (id VARCHAR(255) NOT NULL, name VARCHAR(255) NOT NULL, age INT NOT NULL, PRIMARY KEY (id))' cursor.execute(sql) db.close()