Python进度条神器 tqdm

介绍

优点：

易于使用：只需在 Python 循环中包裹你的迭代器，一行代码就能产生一个精美的进度条。
灵活：它可以和 for 循环、pandas dataframe的 apply 函数以及 Python 的 map 函数等等配合使用。
高效：tqdm 使用了智能算法，即使在数据流非常快的情况下，也不会拖慢你的代码速度。

使用场景

requests下载大的文件

python
def DownLoadBigFile():
    import requests
    from tqdm import tqdm
    # 5个G的文件
    url = "https://mqdb-release-1253802058.cos.ap-beijing.myqcloud.com/datasets/openai-50w-cosine.hdf5"
    # 当你使用 stream=True 参数发送请求时，意味着你希望以流式方式获取响应内容，而不是一次性获取整个响应体。

    response = requests.get(url,stream=True)
    filesize = int(response.headers['Content-Length'])
    chunk = 1
    chunk_size = 1024
    num_bars = int(filesize / chunk_size)
    with open('test.tgz','wb') as fp:
        for chunk in tqdm(response.iter_content(chunk_size=chunk_size),total=num_bars,unit="KB",desc='test.tgz',leave=True):
            fp.write(chunk)

效果如下：

机器学习，对大规模数据进行预处理。

在训练深度学习模型时，我们经常需要迭代大量的 epochs。使用 tqdm，我们可以清晰地看到模型训练的进度。

python
from tqdm import tqdm
# 假设我们有一个训练数据集 train_dataloader 和一个模型 model
for epoch in range(num_epochs):
    epoch_iterator 
= tqdm(train_dataloader, desc=
"Training (Epoch %d)"
 % epoch)
    
    for step, batch in enumerate(epoch_iterator):
        # 模型训练的代码
        # ...

嵌套循环

在许多情况下，我们的代码可能包含嵌套循环。在这种情况下，我们可以使用 tqdm 创建多个进度条。

python
from tqdm import tqdm
import time
for i in tqdm(range(100), desc="Outer loop"):
    for j in tqdm(range(10), desc="Inner loop", leave=False):
        # 执行一些耗时的操作
        time.sleep(0.01)

处理批量数据

有时候，我们可能需要手动更新进度条。例如，当我们在下载文件或处理批量数据时，我们可能一次处理多个项目。在这种情况下，我们可以使用 update 方法。

python
from tqdm import tqdm
import time
with tqdm(total=100) as pbar:
    for i in range(10):
        # 执行一些耗时的操作
        time.sleep(0.1)
        pbar.update(10)

补充：requests的stream的作用是什么？

在 requests 库中，当你使用 stream=True 参数发送请求时，意味着你希望以流式方式获取响应内容，而不是一次性获取整个响应体。

具体来说，这会将 requests 库设置为在接收响应时逐块读取数据。这在处理大型响应内容时特别有用，因为它允许你在下载或处理大文件时逐步处理数据，而不必将整个响应内容加载到内存中。

例如，如果你通过以下方式获取响应：

python
import requests

url = 'http://example.com/large-file.zip'
response = requests.get(url, stream=True)

那么 response 对象将允许你通过迭代器逐块访问响应内容，而不是一次性加载到内存中。你可以使用 iter_content() 方法来访问这些数据块，例如：

python
with open('large-file.zip', 'wb') as f:
    for chunk in response.iter_content(chunk_size=8192):
        if chunk:
            f.write(chunk)

在上面的例子中，iter_content() 方法允许你逐块写入文件，以处理大型文件下载的情况，同时有效地利用内存资源。

因此，stream=True 在 requests 中用于控制如何处理响应内容的传输方式，使其更加灵活和高效。

加和不加stream参数的区别：

不加stream参数：

get请求会把所有的数据请求下来，一个视频1个G的话，会把1G的视频下载到内存里面，然后再进一步操作。
加stream参数：

get请求会先建立连接，而不会把content内容或text内容下载到内存里，等开始对content操作的时候，get请求这个时候才开始下载数据。

扩展：若下载遇到网络、存储等问题，如何解决呢？断点续传

python
def DownLoadBigFile():
    import requests
    # 5个G的文件
    url = "https://mqdb-release-1253802058.cos.ap-beijing.myqcloud.com/datasets/openai-50w-cosine.hdf5"
    # 185kb文件
    url = 'https://1.bp.blogspot.com/-1jq9R-na21U/Xwhip410sFI/AAAAAAAAAAY/fLO-FzWFlU8eKiCA5-IWAR22YoC1mM8-QCLcBGAsYHQ/s2048/real-ai.jpg'

    # 当你使用 stream=True 参数发送请求时，意味着你希望以流式方式获取响应内容，而不是一次性获取整个响应体。
    # 断点续传，使用图片来测试
    # headers = {'Range': 'bytes=0-%d'%(100*1024)} # 第一次，文件写模式：wb
    headers = {'Range': 'bytes=%d-'%(100*1024+1)} # 第二次，文件写模式：ab
    response = requests.get(url,stream=True,headers=headers)
    filesize = int(response.headers['Content-Length'])
    chunk = 1
    chunk_size = 1024
    num_bars = int(filesize / chunk_size)
    with open('test.tgz','ab') as fp:
        for chunk in tqdm(response.iter_content(chunk_size=chunk_size),total=num_bars,unit="KB",desc='test.tgz',leave=True):
            fp.write(chunk)

完整代码

python
import time

from tqdm import tqdm

def SingleLoop():
    for i in tqdm(range(100)):
        # 假设我们正在进行一些耗时的操作，比如训练深度学习模型
        time.sleep(0.01)

def DownLoadBigFile():
    import requests
    # 5个G的文件
    url = "https://mqdb-release-1253802058.cos.ap-beijing.myqcloud.com/datasets/openai-50w-cosine.hdf5"
    # 当你使用 stream=True 参数发送请求时，意味着你希望以流式方式获取响应内容，而不是一次性获取整个响应体。

    response = requests.get(url,stream=True)
    filesize = int(response.headers['Content-Length'])
    chunk = 1
    chunk_size = 1024
    num_bars = int(filesize / chunk_size)
    with open('test.tgz','wb') as fp:
        for chunk in tqdm(response.iter_content(chunk_size=chunk_size),total=num_bars,unit="KB",desc='test.tgz',leave=True):
            fp.write(chunk)

# def PreDealBigData():
#     import pandas as pd 
#     from tqdm import tqdm
#     tqdm.pandas()
#     # 假设我们有一个大的dataframe,我们需要对其‘text’列进行一些预处理
#     df['processed_text'] = df['text'].process_apply(lambda x: preprocess(x))

# def MachineLearn():
#     # 假设我们有一个训练数据集 train_dataloader 和一个模型model
#     for epoch in range(num_epochs):
#         epoch_iterator = = tqdm(train_dataloader, desc="Training (Epoch %d)"% epoch)
#         for step, batch in enumerate(epoch_iterator):
#             # 模型训练的代码
#             pass

def NestedLoop():
    for i in tqdm(range(100),desc="outer loop"):
        # leave=False 是表示：子循环不显示进度条，目的是保证页面的整洁，保证不会被内部进度条干扰
        for j in tqdm(range(10),desc="Inner Loop",leave=False):
            # 执行一些耗时的操作
            time.sleep(0.01)
            
def HandUpdateBar():
    with tqdm(total=100) as bar:
        for i in range(10):
            # 执行一些耗时的操作
            time.sleep(0.1)
            # 代表一次执行10个，进度条是10个的变化累加
            bar.update(10)
if __name__ == '__main__':
    HandUpdateBar()

目录