如何使用 Python 抓取 X.com（原 Twitter）数据（2025 年更新版）

本文介绍了使用Python抓取X.com（原Twitter）数据的方法，主要包含以下内容：技术方案使用Playwright或Scrapfly SDK的无头浏览器技术通过拦截后台XHR请求获取数据利用jmespath解析复杂JSON数据实现功能抓取单条推文内容及元数据（点赞数、转发数等）获取用户个人资料信息提供Playwright和Scrapfly两种实现方式注意事项需遵守X.c

wanhuiba3269

13045人浏览 · 2025-08-26 20:56:24

wanhuiba3269 · 2025-08-26 20:56:24 发布

一、前言

自 Twitter.com 更名为 X.com 后，其公开 API 已停止服务，但网页抓取技术可为我们提供解决方案！在本 X.com 网页抓取教程中，我们将介绍如何使用 Python 和 Playwright 抓取 X.com 的帖子和个人资料。

我们将通过 Python 获取 X.com 的以下数据：

X.com 帖子（推文）信息
X.com 用户个人资料信息

遗憾的是，若不登录，无法抓取其他数据点，但我们会提及一些可能的解决方法和建议。

本教程将无需登录，也无需复杂技巧，仅使用无头浏览器并捕获后台请求，打造一个简单且功能强大的抓取工具。对于无头浏览器环境，我们将使用带有 JavaScript 渲染功能的 Scrapfly SDK；此外，针对未使用 Scrapfly 的用户，我们还会展示如何通过 Playwright 实现类似效果。

二、重要注意事项

在进行抓取操作前，请务必遵守以下规则，避免违规：

请勿以可能对网站造成损害的频率进行抓取。
请勿存储受《通用数据保护条例》（GDPR）保护的欧盟公民个人身份信息（PII）。
请勿重新利用完整的公开数据集，这在部分国家可能属于违法行为。

Scrapfly 不提供法律咨询，但上述是网页抓取中应遵循的基本通用规则，如需更详细的法律指导，建议咨询专业律师。

三、为何要抓取 X.com 数据？

X.com（前身为 Twitter.com）是重要的信息发布平台，个人和企业都会在此发布新闻。通过抓取 X.com 数据，我们能把握行业趋势，例如，可抓取股市或加密货币市场相关信息，用于预测股票或加密货币的未来价格。

同时，X.com 也是情感分析的优质数据源。借助其数据，我们能了解人们对特定主题或品牌的看法，这对市场调研、产品开发和品牌知名度提升都具有重要意义。

因此，若能通过 Python 抓取 X.com 数据，我们就能免费获取这些宝贵的公开信息！

四、项目搭建

本教程将介绍如何使用 Python 结合 scrapfly-sdk 或 Playwright 抓取 X（原 Twitter）数据。为了解析抓取到的 X.com 数据集，我们会使用 Jmespath JSON 解析库，该库可实现 JSON 数据的解析与重构。

上述所有库均为免费可用，可通过 pip install 终端命令进行安装：

$ pip install playwright jmespath scrapfly-sdk

五、X.com 页面工作原理

开始抓取前，我们先通过简单的逆向工程了解 X.com 网站的工作机制，这有助于我们开发 X.com 抓取工具。

首先，X.com 是一个 JavaScript 网页应用，它依赖大量后台请求（XHR）来展示页面数据。简单来说，其工作流程为：加载初始 HTML → 启动 JS 应用 → 通过 XHR 请求加载推文数据，具体流程如下：

HTML → 加载 → JS 应用 → 加载 → XHR → 更新页面 → 更新页面

我们将捕获上述流程中的 XHR 请求！

若不使用 Playwright 或 Scrapfly SDK 等无头浏览器，抓取工作会变得十分困难，因为我们需要逆向解析整个 X.com API 和应用流程。此外，X.com 页面的 HTML 具有动态性且结构复杂，导致解析抓取到的内容难度较大。因此，抓取 X.com（原 Twitter）的最佳方式是使用无头浏览器，并捕获用于获取推文和用户数据的后台请求。

综上，我们的核心思路如下：

启动无头网页浏览器。
启用后台请求捕获功能。
加载 X.com 页面。
筛选出包含帖子或个人资料数据的捕获到的后台请求。

例如，在浏览器开发者工具中查看 X.com（原 Twitter）个人资料页面时，我们能看到 X.com 后台为加载页面数据所发起的请求。

六、抓取 X.com 帖子（推文）

要抓取单个 X.com 帖子页面，我们需使用无头浏览器加载该页面，并捕获用于获取推文详情的后台请求。这类请求可通过 URL 中包含的 TweetResultByRestId 进行识别，其返回的 JSON 响应中包含帖子和作者信息。

下面将分别介绍使用 Playwright 和 Scrapfly SDK 两种方式，通过 Python 实现该抓取功能。

（一）使用 Playwright 抓取

python

运行

from playwright.sync_api import sync_playwright
from typing import Dict
import jmespath

def scrape_tweet(url: str) -> dict:
    """
    抓取单个推文页面的推文线程，例如：
    https://twitter.com/Scrapfly_dev/status/1667013143904567296
    返回主推文、回复推文和推荐推文
    """
    _xhr_calls = []

    def intercept_response(response):
        """捕获所有后台请求并保存"""
        # 从后台请求中提取详情
        if response.request.resource_type == "xhr":
            _xhr_calls.append(response)
        return response

    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=False)
        context = browser.new_context(viewport={"width": 1920, "height": 1080})
        page = context.new_page()

        # 启用后台请求拦截
        page.on("response", intercept_response)
        # 访问目标 URL 并等待页面加载完成
        page.goto(url)
        page.wait_for_selector("[data-testid='tweet']")

        # 筛选出所有与推文相关的后台请求
        tweet_calls = [f for f in _xhr_calls if "TweetResultByRestId" in f.url]
        for xhr in tweet_calls:
            data = xhr.json()
            # 解析推文数据并返回
            return parse_tweet(data['data']['tweetResult']['result'])

def parse_tweet(data: Dict) -> Dict:
    """解析 Twitter 推文 JSON 数据集，提取最重要的字段"""
    result = jmespath.search(
        """{
        created_at: legacy.created_at,
        attached_urls: legacy.entities.urls[].expanded_url,
        attached_urls2: legacy.entities.url.urls[].expanded_url,
        attached_media: legacy.entities.media[].media_url_https,
        tagged_users: legacy.entities.user_mentions[].screen_name,
        tagged_hashtags: legacy.entities.hashtags[].text,
        favorite_count: legacy.favorite_count,
        bookmark_count: legacy.bookmark_count,
        quote_count: legacy.quote_count,
        reply_count: legacy.reply_count,
        retweet_count: legacy.retweet_count,
        text: legacy.full_text,
        is_quote: legacy.is_quote_status,
        is_retweet: legacy.retweeted,
        language: legacy.lang,
        user_id: legacy.user_id_str,
        id: legacy.id_str,
        conversation_id: legacy.conversation_id_str,
        source: source,
        views: views.count
    }""",
        data,
    )
    result["poll"] = {}
    poll_data = jmespath.search("card.legacy.binding_values", data) or []
    for poll_entry in poll_data:
        key, value = poll_entry["key"], poll_entry["value"]
        if "choice" in key:
            result["poll"][key] = value["string_value"]
        elif "end_datetime" in key:
            result["poll"]["end"] = value["string_value"]
        elif "last_updated_datetime" in key:
            result["poll"]["updated"] = value["string_value"]
        elif "counts_are_final" in key:
            result["poll"]["ended"] = value["boolean_value"]
        elif "duration_minutes" in key:
            result["poll"]["duration"] = value["string_value"]
    user_data = jmespath.search("core.user_results.result", data)
    if user_data:
        # 此处可添加 parse_user 函数解析用户数据
        result["user"] = user_data
    return result

if __name__ == "__main__":
    print(scrape_tweet("https://twitter.com/Scrapfly_dev/status/1664267318053179398"))

（二）使用 Scrapfly SDK 抓取

python

运行

import asyncio
import json
from typing import Dict
import jmespath
from scrapfly import ScrapeConfig, ScrapflyClient

# 初始化 Scrapfly 客户端，需替换为你的 Scrapfly 密钥
SCRAPFLY = ScrapflyClient(key="YOUR SCRAPFLY KEY")

async def scrape_tweet(url: str) -> dict:
    """
    抓取单个推文页面的推文线程，例如：
    https://twitter.com/Scrapfly_dev/status/1667013143904567296
    返回主推文、回复推文和推荐推文
    """
    result = await SCRAPFLY.async_scrape(ScrapeConfig(
        url,
        render_js=True,  # 启用无头浏览器
        wait_for_selector="[data-testid='tweet']"  # 等待页面加载完成（推文元素出现）
    ))
    # 捕获后台请求并提取用于获取推文数据的请求
    _xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
    tweet_call = [f for f in _xhr_calls if "TweetResultByRestId" in f["url"]]
    for xhr in tweet_call:
        if not xhr["response"]:
            continue
        data = json.loads(xhr["response"]["body"])
        # 解析推文数据并返回
        return parse_tweet(data['data']['tweetResult']['result'])

def parse_tweet(data: Dict) -> Dict:
    """解析 Twitter 推文 JSON 数据集，提取最重要的字段"""
    result = jmespath.search(
        """{
        created_at: legacy.created_at,
        attached_urls: legacy.entities.urls[].expanded_url,
        attached_urls2: legacy.entities.url.urls[].expanded_url,
        attached_media: legacy.entities.media[].media_url_https,
        tagged_users: legacy.entities.user_mentions[].screen_name,
        tagged_hashtags: legacy.entities.hashtags[].text,
        favorite_count: legacy.favorite_count,
        bookmark_count: legacy.bookmark_count,
        quote_count: legacy.quote_count,
        reply_count: legacy.reply_count,
        retweet_count: legacy.retweet_count,
        text: legacy.full_text,
        is_quote: legacy.is_quote_status,
        is_retweet: legacy.retweeted,
        language: legacy.lang,
        user_id: legacy.user_id_str,
        id: legacy.id_str,
        conversation_id: legacy.conversation_id_str,
        source: source,
        views: views.count
    }""",
        data,
    )
    result["poll"] = {}
    poll_data = jmespath.search("card.legacy.binding_values", data) or []
    for poll_entry in poll_data:
        key, value = poll_entry["key"], poll_entry["value"]
        if "choice" in key:
            result["poll"][key] = value["string_value"]
        elif "end_datetime" in key:
            result["poll"]["end"] = value["string_value"]
        elif "last_updated_datetime" in key:
            result["poll"]["updated"] = value["string_value"]
        elif "counts_are_final" in key:
            result["poll"]["ended"] = value["boolean_value"]
        elif "duration_minutes" in key:
            result["poll"]["duration"] = value["string_value"]
    user_data = jmespath.search("core.user_results.result", data)
    if user_data:
        # 此处可添加 parse_user 函数解析用户数据
        result["user"] = user_data
    return result

if __name__ == "__main__":
    print(asyncio.run(scrape_tweet("https://twitter.com/Scrapfly_dev/status/1664267318053179398")))

（三）示例输出

json

{
  "tweet": {
    "__typename": "Tweet",
    "rest_id": "1664267318053179398",
    "core": {
      "user_results": {
        "result": {
          "__typename": "User",
          "id": "VXNlcjoxMzEwNjIzMDgxMzAwNDAyMTc4",
          "rest_id": "1310623081300402178",
          "affiliates_highlighted_label": {},
          "is_blue_verified": true,
          "profile_image_shape": "Circle",
          "legacy": {
            "created_at": "Mon Sep 28 16:51:22 +0000 2020",
            "default_profile": true,
            "default_profile_image": false,
            "description": "Web Scraping API - turn any website into a database!\n\nScrapFly allows you to quickly achieve your data goals without web scraping challenges and errors.",
            "entities": {
              "description": {
                "urls": []
              },
              "url": {
                "urls": [
                  {
                    "display_url": "scrapfly.io",
                    "expanded_url": "https://scrapfly.io",
                    "url": "https://t.co/1Is3k6KzyM",
                    "indices": [0, 23]
                  }
                ]
              }
            },
            "fast_followers_count": 0,
            "favourites_count": 26,
            "followers_count": 163,
            "friends_count": 993,
            "has_custom_timelines": true,
            "is_translator": false,
            "listed_count": 2,
            "location": "Paris",
            "media_count": 11,
            "name": "Scrapfly",
            "normal_followers_count": 163,
            "pinned_tweet_ids_str": [],
            "possibly_sensitive": false,
            "profile_banner_url": "https://pbs.twimg.com/profile_banners/1310623081300402178/1601320645",
            "profile_image_url_https": "https://pbs.twimg.com/profile_images/1310658795715076098/XedZDwC7_normal.jpg",
            "profile_interstitial_type": "",
            "screen_name": "Scrapfly_dev",
            "statuses_count": 56,
            "translator_type": "none",
            "url": "https://t.co/1Is3k6KzyM",
            "verified": false,
            "withheld_in_countries": []
          }
        }
      }
    },
    "edit_control": {
      "edit_tweet_ids": ["1664267318053179398"],
      "editable_until_msecs": "1685629023000",
      "is_edit_eligible": true,
      "edits_remaining": "5"
    },
    "is_translatable": false,
    "views": {
      "count": "43",
      "state": "EnabledWithCount"
    },
    "source": "<a href=\"https://zapier.com/\" rel=\"nofollow\">Zapier.com</a>",
    "legacy": {
      "bookmark_count": 0,
      "bookmarked": false,
      "created_at": "Thu Jun 01 13:47:03 +0000 2023",
      "conversation_id_str": "1664267318053179398",
      "display_text_range": [0, 122],
      "entities": {
        "media": [
          {
            "display_url": "pic.twitter.com/zLjDlxdKee",
            "expanded_url": "https://twitter.com/Scrapfly_dev/status/1664267318053179398/photo/1",
            "id_str": "1664267314160607232",
            "indices": [123, 146],
            "media_url_https": "https://pbs.twimg.com/media/FxiqTffWIAALf7O.png",
            "type": "photo",
            "url": "https://t.co/zLjDlxdKee",
            "features": {
              "large": {"faces": []},
              "medium": {"faces": []},
              "small": {"faces": []},
              "orig": {"faces": []}
            },
            "sizes": {
              "large": {"h": 416, "w": 796, "resize": "fit"},
              "medium": {"h": 416, "w": 796, "resize": "fit"},
              "small": {"h": 355, "w": 680, "resize": "fit"},
              "thumb": {"h": 150, "w": 150, "resize": "crop"}
            },
            "original_info": {
              "height": 416,
              "width": 796,
              "focus_rects": [
                {"x": 27, "y": 0, "w": 743, "h": 416},
                {"x": 190, "y": 0, "w": 416, "h": 416},
                {"x": 216, "y": 0, "w": 365, "h": 416},
                {"x": 294, "y": 0, "w": 208, "h": 416},
                {"x": 0, "y": 0, "w": 796, "h": 416}
              ]
            }
          }
        ],
        "user_mentions": [],
        "urls": [
          {
            "display_url": "scrapfly.io/blog/top-10-we…",
            "expanded_url": "https://scrapfly.io/blog/top-10-web-scraping-libraries-in-python/",
            "url": "https://t.co/d2iFdAV2LJ",
            "indices": [99, 122]
          }
        ],
        "hashtags": [],
        "symbols": []
      },
      "extended_entities": {
        "media": [
          {
            "display_url": "pic.twitter.com/zLjDlxdKee",
            "expanded_url": "https://twitter.com/Scrapfly_dev/status/1664267318053179398/photo/1",
            "id_str": "1664267314160607232",
            "indices": [123, 146],
            "media_key": "3_1664267314160607232",
            "media_url_https": "https://pbs.twimg.com/media/FxiqTffWIAALf7O.png",
            "type": "photo",
            "url": "https://t.co/zLjDlxdKee",
            "ext_media_availability": {"status": "Available"},
            "features": {
              "large": {"faces": []},
              "medium": {"faces": []},
              "small": {"faces": []},
              "orig": {"faces": []}
            },
            "sizes": {
              "large": {"h": 416, "w": 796, "resize": "fit"},
              "medium": {"h": 416, "w": 796, "resize": "fit"},
              "small": {"h": 355, "w": 680, "resize": "fit"},
              "thumb": {"h": 150, "w": 150, "resize": "crop"}
            },
            "original_info": {
              "height": 416,
              "width": 796,
              "focus_rects": [
                {"x": 27, "y": 0, "w": 743, "h": 416},
                {"x": 190, "y": 0, "w": 416, "h": 416},
                {"x": 216, "y": 0, "w": 365, "h": 416},
                {"x": 294, "y": 0, "w": 208, "h": 416},
                {"x": 0, "y": 0, "w": 796, "h": 416}
              ]
            }
          }
        ]
      },
      "favorite_count": 0,
      "favorited": false,
      "full_text": "A new blog post has been published! \n\nTop 10 Web Scraping Packages for Python \ud83e\udd16\n\nCheckout it out \ud83d\udc47\nhttps://t.co/d2iFdAV2LJ https://t.co/zLjDlxdKee",
      "is_quote_status": false,
      "lang": "en",
      "possibly_sensitive": false,
      "possibly_sensitive_editable": true,
      "quote_count": 0,
      "reply_count": 0,
      "retweet_count": 0,
      "retweeted": false,
      "user_id_str": "1310623081300402178",
      "id_str": "1664267318053179398"
    },
    "quick_promote_eligibility": {
      "eligibility": "IneligibleUserUnauthorized"
    }
  },
  "replies": [],
  "other": []
}

（四）关键说明

我们通过无头浏览器加载推文页面，并捕获所有后台请求，随后筛选出包含推文数据的请求。需要特别注意的是，必须等待页面加载完成（可通过推文在页面 HTML 中出现来判断），否则在后台请求完成前就会返回抓取结果，导致数据不完整。

抓取到的原始 JSON 数据集通常规模较大且结构复杂，后续我们将介绍如何使用 Jmespath JSON 解析库对其进行简化处理。

七、解析推文数据集

抓取到的推文数据集包含大量复杂数据，我们可借助 Jmespath JSON 解析库的 JSON 重构功能，对数据进行简化，重命名关键字段并扁平化嵌套对象，得到更简洁、清晰的数据格式。解析函数（parse_tweet）已在上述抓取代码中给出，其核心功能如下：

提取推文核心信息，如发布时间（created_at）、文本内容（text）、互动数据（点赞数、转发数等）。
整理附加内容，包括附加链接（attached_urls）、媒体资源（attached_media）、提及的用户（tagged_users）和话题标签（tagged_hashtags）。
解析投票数据（若推文包含投票），提取投票选项、结束时间等信息。
关联推文作者信息，可进一步通过 parse_user 函数解析用户详情（函数可根据需求自定义）。

八、抓取 X.com 用户个人资料

抓取 X.com 个人资料页面同样采用后台请求捕获的方法，但此次需捕获包含 UserBy 端点的请求。具体步骤与抓取 X.com 帖子一致：启动无头浏览器 → 启用后台请求捕获 → 加载页面 → 获取数据请求。

以下分别介绍使用 Playwright 和 Scrapfly SDK 实现该功能的代码。

（一）使用 Playwright 抓取

from playwright.sync_api import sync_playwright
import asyncio

def scrape_profile(url: str) -> dict:
    """
    抓取 X.com 个人资料详情，例如：https://x.com/Scrapfly_dev
    """
    _xhr_calls = []

    def intercept_response(response):
        """捕获所有后台请求并保存"""
        # 从后台请求中提取详情
        if response.request.resource_type == "xhr":
            _xhr_calls.append(response)
        return response

    with sync_playwright() as pw:
        browser = pw.chromium.launch(headless=False)
        context = browser.new_context(viewport={"width": 1920, "height": 1080})
        page = context.new_page()

        # 启用后台请求拦截
        page.on("response", intercept_response)
        # 访问目标 URL 并等待页面加载完成（主列元素出现）
        page.goto(url)
        page.wait_for_selector("[data-testid='primaryColumn']")

        # 筛选出所有与用户信息相关的后台请求
        user_calls = [f for f in _xhr_calls if "UserBy" in f.url]
        for xhr in user_calls:
            data = xhr.json()
            return data['data']['user']['result']

if __name__ == "__main__":
    # 注意：Playwright 同步版本无需 asyncio.run，此处为修正原代码小问题
    print(scrape_profile("https://x.com/Scrapfly_dev"))

（二）使用 Scrapfly SDK 抓取

import asyncio
import json
from scrapfly import ScrapeConfig, ScrapflyClient

# 初始化 Scrapfly 客户端，需替换为你的 Scrapfly 密钥
SCRAPFLY = ScrapflyClient(key="YOUR SCRAPFLY KEY")

async def scrape_profile(url: str) -> dict:
    """
    抓取 X.com 个人资料详情，例如：https://x.com/Scrapfly_dev
    """
    result = await SCRAPFLY.async_scrape(ScrapeConfig(
        url,
        render_js=True,  # 启用无头浏览器
        wait_for_selector="[data-testid='primaryColumn']"  # 等待页面加载完成（主列元素出现）
    ))
    # 捕获后台请求并提取用于获取用户数据的请求
    _xhr_calls = result.scrape_result["browser_data"]["xhr_call"]
    user_calls = [f for f in _xhr_calls if "UserBy" in f["url"]]
    for xhr in user_calls:
        if not xhr["response"]:
            continue
        data = json.loads(xhr["response"]["body"])
        return data['data']['user']['result']

if __name__ == "__main__":
    print(asyncio.run(scrape_profile("https://x.com/Scrapfly_dev")))

（三）示例输出

{
  "__typename": "User",
  "id": "VXNlcjoxMzEwNjIzMDgxMzAwNDAyMTc4",
  "rest_id": "1310623081300402178",
  "affiliates_highlighted_label": {},
  "is_blue_verified": true,
  "profile_image_shape": "Circle",
  "legacy": {
    "created_at": "Mon Sep 28 16:51:22 +0000 2020",
    "default_profile": true,
    "default_profile_image": false,
    "description": "Web Scraping API - turn any website into a database!\n\nScrapFly allows you to quickly achieve your data goals without web scraping challenges and errors.",
    "entities": {
      "description": {
        "urls": []
      },
      "url": {
        "urls": [
          {
            "display_url": "scrapfly.io",
            "expanded_url": "https://scrapfly.io",
            "url": "https://t.co/1Is3k6KzyM",
            "indices": [0, 23]
          }
        ]
      }
    },
    "fast_followers_count": 0,
    "favourites_count": 26,
    "followers_count": 163,
    "friends_count": 993,
    "has_custom_timelines": true,
    "is_translator": false,
    "listed_count": 2,
    "location": "Paris",
    "media_count": 11,
    "name": "Scrapfly",
    "normal_followers_count": 163,
    "pinned_tweet_ids_str": [],
    "possibly_sensitive": false,
    "profile_banner_url": "https://pbs.twimg.com/profile_banners/1310623081300402178/1601320645",
    "profile_image_url_https": "https://pbs.twimg.com/profile_images/1310658795715076098/XedZDwC7_normal.jpg",
    "profile_interstitial_type": "",
    "screen_name": "Scrapfly_dev",
    "statuses_count": 56,
    "translator_type": "none",
    "url": "https://t.co/1Is3k6KzyM",
    "verified": false,
    "withheld_in_countries": []
  },
  "business_account": {},
  "highlights_info": {
    "can_highlight_tweets": true,
    "highlighted_tweets": "0"
  },
  "creator_subscriptions_count": 0
}

九、使用 ScrapFly 绕过 X.com 封锁

当大规模抓取 X.com 数据时，很容易遭遇封锁。因为 X.com 不允许自动化请求，在发起几次抓取请求后，抓取工具的 IP 地址就可能被封锁。此时，ScrapFly 可帮助我们解决这一问题！

ScrapFly 提供网页抓取、截图和数据提取 API，支持大规模数据采集，其核心功能包括：

反机器人保护绕过：无需担心封锁，顺利抓取网页。
轮换住宅代理：避免 IP 地址和地域封锁。
JavaScript 渲染：通过云端浏览器抓取动态网页。
完整浏览器自动化：控制浏览器实现滚动、输入和点击等操作。
格式转换：支持以 HTML、JSON、文本或 Markdown 格式输出抓取结果。
提供 Python 和 TypeScript SDK，同时支持 Scrapy 集成和无代码工具集成。

例如，在 Python 中使用 ScrapFly，可借助其 Python SDK 实现，代码如下：

from scrapfly import ScrapflyClient, ScrapeConfig

# 初始化 Scrapfly 客户端，需替换为你的 Scrapfly 密钥
scrapfly = ScrapflyClient(key="YOUR SCRAPFLY KEY")

# 发起抓取请求
result = scrapfly.scrape(ScrapeConfig(
    "https://twitter.com/Scrapfly_dev",
    # 可启用以下功能：
    render_js=True,  # 使用云端无头浏览器
    asp=True,  # 启用反抓取保护绕过
    screenshots={"all": "fullpage"},  # 截取全页截图
    country="US",  # 选择代理所在国家（美国）
))

十、抓取 X.com 搜索结果、回复和时间线

本教程已介绍如何抓取 X.com 上公开可访问的帖子和个人资料。而搜索结果、时间线等其他区域则不对外开放，需登录才能访问，且登录后抓取可能导致账号被封禁。

X.com 仅对安卓设备提供时间线和推文搜索的公开访客预览权限，这是无需登录即可抓取 X.com 时间线、推文回复和搜索结果的唯一途径。

目前，最可靠且实时更新的相关资源是 Nitter.net，它是一个开源的 X（原 Twitter）替代前端。若需了解更多相关信息，建议关注 GitHub 上的 Nitter 访客账号分支（Nitter Guest Account Branch）。

十一、常见问题（FAQ）

为帮助大家更好地使用本 Python X.com（原 Twitter）抓取工具，以下解答一些关于 X.com 网页抓取的常见问题：

1. 抓取 X.com 是否合法？

合法。X.com 上的所有数据均为公开数据，因此抓取行为本身是合法的。但需注意，部分推文可能包含图片、视频等受版权保护的内容，将此类数据用于商业用途可能涉嫌违法。

2. 如何避免抓取 X.com 时被封锁？

X.com 是一个复杂的、依赖 JavaScript 的网站，且对网页抓取持抵制态度，因此抓取工具很容易被封锁。若要避免封锁，可使用 ScrapFly，它提供反抓取技术绕过和代理轮换功能；此外，也可参考相关文章了解更多避免网页抓取工具被封锁的方法。

3. 登录后抓取 X.com 是否合法？

登录后抓取 X.com 的合法性处于灰色地带。通常，登录行为会使用户受网站服务条款约束，而 X.com 的服务条款禁止自动化抓取。因此，登录后抓取可能导致账号被封禁，甚至面临法律诉讼。建议尽可能避免登录后抓取 X.com。

4. 如何减少带宽占用并提高 X.com 抓取速度？

若使用本文中提到的 Playwright 等浏览器自动化工具，可通过阻止加载图片和不必要的资源来节省带宽，同时提高抓取速度。

十二、X.com 抓取总结

在本教程中，我们使用 Python 结合 Playwright 或 Scrapfly SDK 的无头浏览器功能，开发了一个简易的 X.com（原 Twitter）抓取工具。

首先，我们分析了 X.com 的工作机制，确定了数据所在位置 —— 发现 X.com 通过后台请求获取并填充帖子和个人资料数据。随后，我们利用 Playwright 或 Scrapfly-SDK 的拦截功能捕获这些后台请求，并使用 jmespath 将原始数据集解析为简洁清晰的 JSON 格式。

最后，为避免抓取过程中遭遇封锁，我们介绍了 ScrapFly 网页抓取 API，它可通过代理和反抓取技术绕过功能，支持大规模抓取 X.com 数据。ScrapFly 提供免费试用，欢迎体验！