Selenium Web Scraping Java

Selenium Web Scraping Java

Web Scraping 101 (Using Selenium for Java) – Gal Abramovitz

Datacenter proxies

  • HTTP & SOCKS
  • Price $1.3/IP
  • Locations: DE, RU, US
  • 5% OFF coupon: APFkysWLpG

Visit proxy6.net


Photo by rawpixel on UnsplashWeb Scraping is one of the most useful skills in today’s digital sically it takes web-browsing to the next level, by automatizing everyday actions, such as opening URLs, reading text and data, clicking links, though web scraping is relatively easy to learn and execute, it’s a powerful tool that you can use to collect existing data from websites, then easily manipulate, analyze and store it to your liking; or to automate workflows in web-based this tutorial I’ll show you step-by-step:How to set up Selenium in IntelliJ to build a basic web scraper, that can read data from a webpage. * The stages are followed by matching screenshotsI found it easiest to use Selenium Standalone Server. Download the JAR file in the given IntelliJ IDEA and create a new one of your project’s directories and click Open Module the Modules section click the Dependencies tab, then click the “+” button and choose JARs or directories…Choose the JAR file you’ve downloaded in stage 1, then click External Libraries should now contain the JAR wnload Geckodriver (choose the file according to your operating system) and put the executable file in the project’s ’s it – you’re all set! You can take a look at the self-explanatory screenshots and move on to the second part of the 3: click Open Module SettingsStage 4: in the Modules section click the Dependencies tab, then click the “+” button and choose JARs or directories…Stage 6: The External Libraries should now contain the JAR fileStage 7: your project’s directory should look like thisAfter we’ve set up our working environment, let’s use Selenium to build a basic web ’ll use the HTML structure of a webpage and read specific data (we’ll actually take advantage of the CSS code, but the principles are the same). Preparing for web scrapingBefore we actually start to scrape, we need to understand what we’re looking for. This part will be much easier for you if you can read HTML this tutorial we’ll scrape my website and extract the list of languages I use. It might seem a tough task at first, but a glance into the paragraph’s HTML code will reveal a surprising structure:

VPN

  • No logs
  • Kill Switch
  • 6 devices
  • Monthly price: $4.92

Visit nordvpn.com


In my work and personal projects I use

Java

,

JavaScript

,

jQuery

,

HTML

and

CSS

. My university projects also use

C

,

C++

,

Assembly

,

SQL

and

Python

.
As I’m highly aware to the elegance of my products, I keep learning and pushing myself towards cleaner code and more beautiful UI.

You can tell immediately that each language name is wrapped in a separate

tag. We’ll use this fact later in our scraper’s code. I obviously already know my own website, but unravelling each page’s code is the first challenge that you’ll need to solve. In most cases there is an obvious logic to the HTML structure, that we can take advantage of. Understand the HTML logic – and your life would get much next step would be to plan the scraping workflow. I suggest you go through your workflow manually a couple of times before writing your web this basic example, this is the workflow we’re going to implement:Open a new browser vigate to the menu option “ABOUT” the languages names, according to the logic we’ve lenium Building BlocksI like to think about a (basic) web scraper as a synchronous program that mainly uses two objects instances:A WebDriver instance, which is our code’s connection to the browser’s window: it allows to read its code and simulate user’s WebDriver instance corresponds to an individual browser window. (in this tutorial I’ll use a FirefoxDriver, just because I usually use Chrome and I like to have a separate browser for my web scraping applications)A WebDriverWait instance, which allows us to synchronize our actions (e. g. click a button only after it loads and is clickable) might seem a bit abstract, but don’t worry – in a moment we’ll use both of them in nally – let’s write some code! (The full code is provided in this repository)We’ll start by initializing a FirefoxDriver and a WebDriverWait:FirefoxDriver driver = new FirefoxDriver();WebDriverWait wait = new WebDriverWait(driver, 30);Note that WebDriverWait remains the same, regardless of the browser you choose, and that it’s connected to the FirefoxDriver instance. The second argument in its constructor represents the maximum amount of time that our program should wait for an action (e. a page load) initialization, we have one new Firefox window that’s controlled by our program. Now let’s navigate to a specific vigate()(“);Now we want to click an element. We’ll need to specify a one-to-one address that’ll point this exact element. There are a few addresses we can use, but I find XPath to be the most convenient order to find the button’s XPath I’m going to use Chrome’s Developer Tools: right-click the element, then click page’s HTML code will be opened and the inspected element will be highlighted. Right-click it, click Copy and then Copy you have the element’s XPath that can be used as its address! Let’s save it as a String:String aboutButtonXpath = “//*[@id=\”about\””]div/a””;The code is executed regardless of the browser’s state
so we need to make sure the button element is loaded before we try to click it. As mentioned

so we need to make sure the button element is loaded before we try to click it. As mentioned

we can use our wait instance just for that (one of the reasons I love Selenium is because the syntax is self-explanatory)(ExpectedConditions. elementToBeClickable((aboutButtonXpath)));Our WebDriverWait instance blocks our thread until the button element is clickable. Hence
right after it we can click ndElement((aboutButtonXpath))();Note how the click is called via the same method

right after it we can click ndElement((aboutButtonXpath))();Note how the click is called via the same method

we make sure the relevant paragraph is loaded before trying to copy its data:String languagesParagraphXpath = “//*[@id=\””page1\””]/div[2]/div[5]”;(sibilityOfElementLocated((languagesParagraphXpath)));Now we can use the logic that we’ve defined in the workflow. Using ndElements() method

Frequently Asked Questions about selenium web scraping java

each representing a

that contains a language name:List languageNamesList = ndElements((“”//*[@id=\””page1\””]/div[2]/div[5]/p””));Note that the last level in the XPath’s hierarchy doesn’t have an index. That’s because this XPath points to the array of all elements that satisfy this HTML tags hierarchy. Another detail worth mentioning is that the method used here is ndElements (in plural form)

which suggests that the expected return value is a ’re finished scraping

so we can close the browser ();Lastly

I chose to just print the language names. However

you can do anything you want with them

as they’re now accessible from within your code! Stage 3: finding an element’s XPathThat’s it. I hope I was able to give you a basic understanding of Selenium for Java

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *