Alex-Programer

Alex-Programer

随缘博客,不定期更新不确定的内容~
github
twitter

The simplest crawler

Background#

I have never written a web crawler in my work before. Recently, my supervisor gave me a task to crawl the historical price information of two tokens from birdeye. As a frontend developer, my first thought was that all requests are made from the webpage. So, I should be able to write the request code in the console of this website, and then send the data to my local API after obtaining it.

Obtain the original request information#

image

The Google Chrome console provides a convenient feature to quickly copy a request.

The reason for writing the script in the browser console is that when writing the script locally to make the request, there are many additional considerations.
For example, analyzing what request headers and request methods need to be included in the request, the cost of trial and error is high. In this case, simply copy a fetch request and run it directly in the browser console. It is highly likely to work, and everything that should be included will be included, allowing me to focus on data processing.

Attempt to run, it works.

image

Write the script#

// The token to be crawled
const token = "xxx";
// Start time
const startDate = new Date("2023-01-01");
// End time
const endDate = new Date("2023-01-02");
// Interval of 30 minutes
const type = "30m";

(async () => {
  const getToken = (start, end) => {
    return new Promise((resolve) => {
      fetch(
        `https://api.birdeye.so/amm/ohlcv?address=${token}&currency=usd&type=${type}&time_from=${start}&time_to=${end}`,
        {
          headers: {
            accept: "application/json, text/plain, */*",
            "accept-language": "zh-CN",
            "agent-id": "93bd6afd-cbf4-4acc-a014-b62d88374ea6",
            "cf-be":
              "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpYXQiOjE2Nzk0Njk3ODIsImV4cCI6MTY3OTQ3MDA4Mn0.eWkKWRUsPKMZCqNJB5KL_Z_Eurn4wCMN5dyzCR5oJ_U",
            "sec-ch-ua": '"Not;A=Brand";v="99", "Chromium";v="106"',
            "sec-ch-ua-mobile": "?0",
            "sec-ch-ua-platform": '"macOS"',
            "sec-fetch-dest": "empty",
            "sec-fetch-mode": "cors",
            "sec-fetch-site": "same-site",
            "x-cypress-is-xhr-or-fetch": "true",
            Referer: "https://birdeye.so/",
            "Referrer-Policy": "strict-origin-when-cross-origin",
          },
          body: null,
          method: "GET",
        }
      )
        .then((res) => res.json())
        .then((res) => {
          try {
            resolve(res.data.items);
          } catch (error) {
            console.error(error);
          }
        });
    });
  };

  for (
    let currentDate = startDate;
    currentDate <= endDate;
    currentDate.setDate(currentDate.getDate() + 1)
  ) {
    
    const startTimestamp = Math.floor(currentDate.getTime() / 1000);
    const endTimestamp = startTimestamp + 86399;

    try {
      console.log("Start reading data...", [startTimestamp, endTimestamp]);
      
      // Reduce the pressure on the server
      await new Promise((resolve) => {
        setTimeout(resolve, 1000);
      });
      
      const data = await getToken(startTimestamp, endTimestamp);
      
      console.log(
        `${new Date(startTimestamp * 1000).toLocaleString()} - ${new Date(
          endTimestamp * 1000
        ).toLocaleString()} Data reading completed`
      );

      // Send the data to my local API service
      fetch("http://localhost:3000", {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
        },
        body: JSON.stringify(data),
      });
    } catch (error) {
      console.log(error, "Failed to read data");
      // This is crucial. Printing this log is to consider the case where the API is limited in the number of requests. After a failure, adjust the startDate at the top based on the information printed here, and continue crawling data after a while.
      console.log(startTimestamp, endTimestamp);
      break;
    }
  }
})();

Local API service#

import express from "express";
import bodyParser from "body-parser";
import cors from "cors";

const origin = "https://birdeye.so";
const filePath = "./data.json";

const app = express();
app.use(bodyParser.json());
app.use(cors({ origin }));

app.post("/", async (req, res) => {
  // Handle the data format yourself
  console.log(req.body);

  res.send("ok");
});

app.listen(3000);

Summary#

At first, I wanted to use Cypress to implement this by injecting the script, which would be more friendly for engineering purposes. However, this feature is temporary after all, and there are no additional requirements later. So I implemented it simply; fortunately, the Google Chrome console provides very convenient tools. Otherwise, it would be quite troublesome to check the request headers and cookies one by one.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.